Convertmsa and msa2profile for 3di alignemnts? #452

EvanKomp · 2025-04-16T16:12:35Z

Hey all - I want to start a search with a seed alignment of know active homologs. I created the aa and 3di MSA with foldmason.

In mmseqs, I would create a profile db from the starting msa and run search on that - however in foldseek I am not sure how. I tried foldseek convertmsa and then foldseek msa2profile on the 3di alignment, however when I try to run search it says it cannot find the _ss suffix. I assume this is because creating a database like I did does not execute the required additional foldseek functionalities on to of mmseqs that the createdb on a set of PDBs does.

Any help very appreciated. I think a 3di alignment from such a workflow would be an excellent way to visualize structural space of a certain type of protein with eg. a VAE.

The text was updated successfully, but these errors were encountered:

martin-steinegger · 2025-04-17T06:28:42Z

It is not super easy yet to convert an external MSA to a profile. @sooyoung-cha, did you try to get external MSAs into profiles?
Something that works out of the box would be to use internal MSAs:

foldseek search query target aln tmp -a 
foldseek result2profile query target aln prof 
foldseek search prof target aln tmp

In order to use an external MSA you could try (I have not tested this though):

foldseek convertmsa inputAA.fa msa
foldseek convertmsa input3Di.fa msa_ss
foldseek msa2profile msa profile

Here the MSAs need to be formatted exactly the same way, meaning the same sequence order and the same gaps per sequence.

EvanKomp · 2025-04-23T17:40:57Z

@martin-steinegger Thanks for the reply.

The below got me to the point of a search that runs without error:

### convert alignment to msa database
foldseek convertmsa fold_align_aa.fa query/query_msa
foldseek convertmsa fold_align_3di.fa qiery/query_msa_ss

### convert msa database to profile database
foldseek msa2profile query/query_msa query/query_profile
foldseek msa2profile query/query_msa_ss query/query_profile_ss

### run search
foldseek search query/query_profile /datasets/UniprotKB/afdb ./search ./tmpSearch -c 0.6 --cov-mode 2 --threads 104 --alignment-type 0  --split-memory-limit 180G

However when unpacking the result:

### convert and unpack with result2msa and unpackdb
foldseek result2msa query/query_profile /datasets/UniprotKB/afdb search hit_msa --msa-format-mode 6
foldseek unpackdb hit_msa hit_msa_unpacked --unpack-suffix .a3m --unpack-name-mode 0

0.a3m is a big ol alignment but only contains the first sequence in the original msa. I assume this is user error, but it sort of seems like the search only used a single query sequence even though the query db was a profile one. Any ideas and thanks in advance?

EDIT: I hypothesized that maybe this was because of default parameters of msa2profile removing sequences from the MSA before computing the profile. I instead created both aa and ss profiles by foldseek msa2profile query/query_msa query/query_profile --match-mode 1 --match-ratio 0.8 --filter-msa 0, but now the search dies:

foldseek search query/query_profile /datasets/UniprotKB/afdb ./search ./tmpSearch --threads 104 --alignment-type 0  --split-memory-limit 180G -s 2 -a
Create directory ./tmpSearch
search query/query_profile /datasets/UniprotKB/afdb ./search ./tmpSearch --threads 104 --alignment-type 0 --split-memory-limit 180G -s 2 -a 

MMseqs Version:                         10.941cd33
Seq. id. threshold                      0
Coverage threshold                      0
Coverage mode                           0
Max reject                              2147483647
Max accept                              2147483647
Add backtrace                           true
TMscore threshold                       0
TMscore threshold mode                  0
TMalign hit order                       0
TMalign fast                            1
Preload mode                            0
Threads                                 104
Verbosity                               3
LDDT threshold                          0
Sort by structure bit score             1
Alignment type                          0
Exact TMscore                           0
Substitution matrix                     aa:3di.out,nucl:3di.out
Alignment mode                          3
Alignment mode                          0
E-value threshold                       10
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Gap open cost                           aa:10,nucl:10
Gap extension cost                      aa:1,nucl:1
Compressed                              0
Seed substitution matrix                aa:3di.out,nucl:3di.out
Sensitivity                             2
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Max results per query                   1000
Split database                          0
Split mode                              2
Split memory limit                      180G
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           0
Mask residues probability               0.999995
Mask lower case residues                1
Mask lower letter repeating N times     6
Minimum diagonal score                  30
Selected taxa                      
Spaced k-mers                           1
Spaced k-mer pattern               
Local temporary path               
Use GPU                                 0
Use GPU server                          0
Wait for GPU server                     600
Prefilter mode                          0
Exhaustive search mode                  false
Search iterations                       1
Remove temporary files                  true
MPI runner                         
Force restart with latest tmp           false
Cluster search                          0

prefilter query/query_profile_ss /datasets/UniprotKB/afdb_ss ./tmpSearch/16735541491874290893/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 2 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 180G -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.999995 --mask-lower-case 1 --mask-n-repeat 6 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 104 --compressed 0 -v 3 

Query database size: 1 type: Profile
Target split mode. Searching through 11 splits
Estimated memory consumption: 155G
Target database size: 214683829 type: Aminoacid
Process prefiltering step 1 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.56M 12s 735ms    
Index table: Masked residues: 728815011
Index table: fill
[=================================================================] 100.00% 19.56M 20s 502ms       
Index statistics
Entries:          5063035714
DB size:          38736 MB
Avg k-mer size:   3.955497
Top 10 k-mers
    VLVLVVV     3259658
    SVSVVVV     3116512
    LVVVVVV     2850604
    VVVVVVV     2705558
    VSVVVVV     1942075
    SSVVVVV     1524001
    VVVVCVV     1378131
    SSVSVVV     1368735
    SNVVVVV     1288116
    DPSVVVV     994572
Time for index table init: 0h 0m 47s 250ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 1 of 11)
Query db start 1 to 1
Target db start 1 to 19555737
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59422155 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_0: 0h 0m 0s 2ms
Time for merging to pref_tmp_0_tmp: 0h 0m 0s 2ms
Process prefiltering step 2 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.40M 10s 571ms    
Index table: Masked residues: 765704783
Index table: fill
[=================================================================] 100.00% 19.40M 20s 279ms       
Index statistics
Entries:          5022494436
DB size:          38504 MB
Avg k-mer size:   3.923824
Top 10 k-mers
    VLVLVVV     3246333
    SVSVVVV     3097487
    LVVVVVV     2837946
    VVVVVVV     2713764
    VSVVVVV     1943484
    SSVVVVV     1511240
    VSVNVVV     1405579
    SSVSVVV     1355942
    SNVVVVV     1276593
    DPSVVVV     988214
Time for index table init: 0h 0m 46s 635ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 2 of 11)
Query db start 1 to 1
Target db start 19555738 to 38959624
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58364017 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_1: 0h 0m 0s 2ms
Time for merging to pref_tmp_1_tmp: 0h 0m 0s 1ms
Process prefiltering step 3 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.42M 10s 362ms    
Index table: Masked residues: 768096778
Index table: fill
[=================================================================] 100.00% 19.42M 17s 791ms    
Index statistics
Entries:          5019242848
DB size:          38485 MB
Avg k-mer size:   3.921283
Top 10 k-mers
    VLVLVVV     3239441
    SVSVVVV     3084889
    LVVVVVV     2832356
    VVVVVVV     2709768
    VSVVVVV     1934780
    DPVVVVV     1714739
    SSVVVVV     1507868
    SSVSVVV     1348844
    SNVVVVV     1275517
    DPSVVVV     983814
Time for index table init: 0h 0m 41s 96ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 3 of 11)
Query db start 1 to 1
Target db start 38959625 to 58376168
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58258151 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_2: 0h 0m 0s 2ms
Time for merging to pref_tmp_2_tmp: 0h 0m 0s 2ms
Process prefiltering step 4 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.50M 10s 228ms    
Index table: Masked residues: 745687563
Index table: fill
[=================================================================] 100.00% 19.50M 21s 287ms       
Index statistics
Entries:          5044971177
DB size:          38633 MB
Avg k-mer size:   3.941384
Top 10 k-mers
    VLVLVVV     3253205
    SVSVVVV     3106731
    LVVVVVV     2843608
    VVVVVVV     2705200
    VSVVVVV     1939666
    SSVVVVV     1517270
    VVVVCVV     1380649
    SSVSVVV     1361537
    SNVVVVV     1284169
    DPSVVVV     991381
Time for index table init: 0h 0m 46s 137ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 4 of 11)
Query db start 1 to 1
Target db start 58376169 to 77880679
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59061544 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_3: 0h 0m 0s 2ms
Time for merging to pref_tmp_3_tmp: 0h 0m 0s 1ms
Process prefiltering step 5 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.60M 9s 917ms     
Index table: Masked residues: 742844995
Index table: fill
[=================================================================] 100.00% 19.60M 17s 779ms    
Index statistics
Entries:          5046317092
DB size:          38640 MB
Avg k-mer size:   3.942435
Top 10 k-mers
    VLVLVVV     3269420
    SVSVVVV     3111766
    LVVVVVV     2855704
    VVVVVVV     2722385
    VSVVVVV     1948106
    SSVVVVV     1518315
    VVVVCVV     1382643
    SSVSVVV     1357458
    SNVVVVV     1288378
    DPSVVVV     992054
Time for index table init: 0h 0m 41s 557ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 5 of 11)
Query db start 1 to 1
Target db start 77880680 to 97476863
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59018457 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_4: 0h 0m 0s 2ms
Time for merging to pref_tmp_4_tmp: 0h 0m 0s 1ms
Process prefiltering step 6 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.51M 10s 614ms   
Index table: Masked residues: 757250387
Index table: fill
[=================================================================] 100.00% 19.51M 18s 75ms     
Index statistics
Entries:          5030573380
DB size:          38550 MB
Avg k-mer size:   3.930135
Top 10 k-mers
    VLVLVVV     3260308
    SVSVVVV     3096683
    LVVVVVV     2853667
    VVVVVVV     2717024
    VSVVVVV     1933661
    SSVVVVV     1512613
    VVVVCVV     1377403
    SSVSVVV     1353638
    SNVVVVV     1280451
    DPSVVVV     993383
Time for index table init: 0h 0m 41s 391ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 6 of 11)
Query db start 1 to 1
Target db start 97476864 to 116986050
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58489081 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_5: 0h 0m 0s 2ms
Time for merging to pref_tmp_5_tmp: 0h 0m 0s 1ms
Process prefiltering step 7 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.59M 12s 555ms    
Index table: Masked residues: 744285405
Index table: fill
[=================================================================] 100.00% 19.59M 21s 28ms        
Index statistics
Entries:          5044354030
DB size:          38629 MB
Avg k-mer size:   3.940902
Top 10 k-mers
    VLVLVVV     3263742
    SVSVVVV     3115363
    LVVVVVV     2852244
    VVVVVVV     2720900
    VSVVVVV     1945877
    SSVVVVV     1520299
    VVVVCVV     1388615
    SSVSVVV     1357513
    SNVVVVV     1290651
    DPSVVVV     993954
Time for index table init: 0h 0m 47s 787ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 7 of 11)
Query db start 1 to 1
Target db start 116986051 to 136573459
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59114449 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_6: 0h 0m 0s 2ms
Time for merging to pref_tmp_6_tmp: 0h 0m 0s 1ms
Process prefiltering step 8 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.41M 14s 471ms    
Index table: Masked residues: 768407198
Index table: fill
[=================================================================] 100.00% 19.41M 21s 377ms       
Index statistics
Entries:          5018897018
DB size:          38483 MB
Avg k-mer size:   3.921013
Top 10 k-mers
    VLVLVVV     3231672
    SVSVVVV     3084495
    LVVVVVV     2826156
    VVVVVVV     2701578
    VSVVVVV     1933911
    SSVVVVV     1501996
    VVVVCVV     1370173
    SSVSVVV     1347443
    SNVVVVV     1270275
    DPSVVVV     989075
Time for index table init: 0h 0m 51s 710ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 8 of 11)
Query db start 1 to 1
Target db start 136573460 to 155981499
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58320674 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_7: 0h 0m 0s 2ms
Time for merging to pref_tmp_7_tmp: 0h 0m 0s 2ms
Process prefiltering step 9 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.48M 13s 916ms    
Index table: Masked residues: 745901634
Index table: fill
[=================================================================] 100.00% 19.48M 20s 994ms       
Index statistics
Entries:          5044443419
DB size:          38630 MB
Avg k-mer size:   3.940971
Top 10 k-mers
    VLVLVVV     3254048
    SVSVVVV     3097754
    LVVVVVV     2837785
    VVVVVVV     2700264
    VSVVVVV     1929842
    SSVVVVV     1506275
    VVVVCVV     1372586
    SSVSVVV     1351086
    SNVVVVV     1288878
    DPSVVVV     989427
Time for index table init: 0h 0m 48s 419ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 9 of 11)
Query db start 1 to 1
Target db start 155981500 to 175456597
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59053382 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_8: 0h 0m 0s 2ms
Time for merging to pref_tmp_8_tmp: 0h 0m 0s 1ms
Process prefiltering step 10 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.56M 13s 475ms    
Index table: Masked residues: 723365664
Index table: fill
[=================================================================] 100.00% 19.56M 20s 933ms       
Index statistics
Entries:          5069810473
DB size:          38775 MB
Avg k-mer size:   3.960789
Top 10 k-mers
    VLVLVVV     3253276
    SVSVVVV     3113180
    LVVVVVV     2839470
    VVVVVVV     2704703
    VSVVVVV     1937111
    SSVVVVV     1520266
    VVVVCVV     1381843
    SSVSVVV     1362145
    SNVVVVV     1282927
    DPSVVVV     994258
Time for index table init: 0h 0m 47s 805ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 10 of 11)
Query db start 1 to 1
Target db start 175456598 to 195012177
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59518254 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_9: 0h 0m 0s 33ms
Time for merging to pref_tmp_9_tmp: 0h 0m 0s 1ms
Process prefiltering step 11 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.67M 13s 684ms    
Index table: Masked residues: 741626195
Index table: fill
[=================================================================] 100.00% 19.67M 17s 513ms    
Index statistics
Entries:          5045881259
DB size:          38638 MB
Avg k-mer size:   3.942095
Top 10 k-mers
    VLVLVVV     3265403
    SVSVVVV     3113183
    LVVVVVV     2857180
    VVVVVVV     2723157
    VSVVVVV     1940502
    SSVVVVV     1514450
    VVVVCVV     1390607
    SSVSVVV     1357116
    SNVVVVV     1287020
    DPSVVVV     996860
Time for index table init: 0h 0m 45s 149ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 11 of 11)
Query db start 1 to 1
Target db start 195012178 to 214683829
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59152452 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_10: 0h 0m 0s 2ms
Time for merging to pref_tmp_10_tmp: 0h 0m 0s 1ms
Merging 11 target splits to pref
Preparing offsets for merging: 0h 0m 0s 35ms
[=================================================================] 100.00% 1 eta -
Time for merging to pref: 0h 0m 0s 88ms
Time for merging target splits: 0h 0m 0s 274ms
Time for merging to pref_tmp: 0h 0m 0s 103ms
Time for processing: 0h 9m 29s 302ms
structurealign query/query_profile /datasets/UniprotKB/afdb ./tmpSearch/16735541491874290893/pref ./tmpSearch/16735541491874290893/strualn --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 1 --alignment-type 0 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 1 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 0.5 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 104 --compressed 0 -v 3 

Cannot find query_profile C-alpha or afdb C-alpha database
Disabling --sort-by-structure-bits
This impacts the final score and ranking of hits, but not E-values themselves. Ranking alterations primarily occur for E-values < 10^-1.
[=================================================================] 100.00% 1 eta -
./tmpSearch/16735541491874290893/structuresearch.sh: line 112: 2439915 Segmentation fault      (core dumped) $RUNNER "$MMSEQS" $ALIGNMENT_ALGO "${QUERY_ALIGNMENT}" "${TARGET_ALIGNMENT}${INDEXEXT}" "${TMP_PATH}/pref" "${TMP_PATH}/strualn" ${ALIGNMENT_PAR}
Error: Structure alignment step died```

EvanKomp · 2025-05-07T14:49:13Z

@martin-steinegger @sooyoung-cha Please advise, starting a structure search from a foldmason alignment would be a really cool result / useful for mapping out the structural space of enzymes with a particular activity.

martin-steinegger · 2025-05-07T16:08:04Z

It is really hard to judge from here. Could you upload the MSAs you used plesae?

EvanKomp · 2025-05-08T18:13:23Z

@martin-steinegger You are amazing, truly. Here they are attached. There are small enough to view in text too:

amino acids

>504
------------------------------------------------------------QAQYQKGPDPTASALER-NGPFAIRSTSVSRTSVSGF-GGGRLYYPT--A-SGTYGAIAVSP--GFTGTSS-TMTFWGERLASHGFVVLVIDTITLYD----------QP-----------DSRARQLKAALDYLATQNGRSSSPIYRKVDTSRRAVAGHSMGGGGSLLAARDNP----SYKAAIPMAP-----------------------------------------W--------NTSSTA-------------------------------------FRTVSVPTMIFGCQDDSIAPVFSHAIPFYNAIPNSTRKNYVEIRNDDH-FCVMNGGGHDATLGKLGISWMKRFVDNDTRYSPFVCGAEYNRVVSSYEVSRSYNN-CPY--
>IsPETase
------------------------------------------------------------TNPYARGPNPTAASLEASAGPFTVRSFTVS--RPSGY-GAGTVYYPT--NAGGTVGAIAIVP--GYTARQS-SIKWWGPRLASHGFVVITIDTNSTLD----------QP-----------SSRSSQQMAALRQVASLNGTSSSPIYGKVDTARMGVMGWSMGGGGSLISAANNP----SLKAAAPQAP-----------------------------------------W--------DSSTN--------------------------------------FSSVTVPTLIFACENDSIAPVNSSALPIYDSMS-RNAKQFLEINGGSH-SCANSGNSNQALIGKKGVAWMKRFMDNDTRYSTFAC---EN-P-NSTRVSDFRTANCSLEH
>202_A
------------------------------------------------------------------VPVS------------PLQRVNF---YSAGYRLDGLLYTPRHLPAGERRPGVVLL--VGYTYLKTMVMPDIAKVLNAAGYVALV-FDYRGFGESEGPRGRLI------------PLEQVADARAALTFLAE----Q-----SMVDPDRLAVIGISLGGAHAITTAAL----DQRVRAVVALEPPG----HGARWLRSLRRH-WEWRQFLSRLAEDRRQRVLSGGSTMVDPLEIVLPDPESQAFLDQVAAEFPQMKVTLPLES-AEALIEYVSEDLAGRIAPRPLLIIHSDADQLVPVA-EAQAIAERAGSSAQLEII--PGMSHFNWVMPGSPGFTRVTDSIVKFLRNTLPVS---------------------------------
>202_B
---------------------------------------------------------------------S------------PLQRVNF---YSAGYRLDGLLYTPRHLPAGERRPGVVLL--VGYTYLKTMVMPDIAKVLNAAGYVALV-FDYRGFGESEGPRGRLI------------PLEQVADARAALTFLAE----Q-----SMVDPDRLAVIGISLGGAHAITTAAL----DQRVRAVVALEPPG----HGARWLRSLRRH-WEWRQFLSRLAEDRRQRVLSGGSTMVDPLEIVLPDPESQAFLDQVAAEFPQMKVTLPLES-AEALIEYVSEDLAGRIAPRPLLIIHSDADQLVPVA-EAQAIAERAGSSAQLEII--PGMSHFNWVMPGSPGFTRVTDSIVKFLRNTLPV----------------------------------
>611
---------------------------------------------------------------DVHGPDPTEESITAPRGPFEVDEESVSRLSVSG-FGGGTIYYP-TDTTDGLFSAVSISP--GFTGTQ-ETMAWYGPRLASQGFVVFTIDTITTT----------DQP-----------DSRARQLQASLDYLV---NDSD--VKDIIDPARLGVMGHSMGGGGSLKAALDNP----ALKAAIPLT-----------------------------------------PW--------HTTK--------------------------------------DFSGVQTPTLIIGAQNDTVAPVSQHAKPFYESLPDDPGKAYLELAGAS---HLAPNTD-NTTIAKFSIAWLKRFLDDDTRYDQFLCPPPE---NDD-SISDYQS-TCPYL-
>TfCut2
------------------------------------------------------------ANPYERGPNPTDALLEARSGPFSVSEENVSRLSASG-FGGGTIYYP-REN--NTYGAVAISP--GYTGTE-ASIAWLGERIASHGFVVITIDTITTL----------DQP-----------DSRAEQLNAALNHMI---NRASSTVRSRIDSSRLAVMGHSMGGGGSLRLASQRP----DLKAAIPLT-----------------------------------------PW--------HLNK--------------------------------------NWSSVTVPTLIIGADLDTIAPVATHAKPFYNSLPSSISKAYLELDGAT---HFAPNIP-NKIIGKYSVAWLKRFVDNDTRYTQFLCPGPR---GLFGEVEEYRS-TCPFYP
>LCC
------------------------------------------------------------SNPYQRGPNPTRSALTAD-GPFSVATYTVSRLSVSG-FGGGVIYYP-TGT-SLTFGGIAMSP--GYTADA-SSLAWLGRRLASHGFVVLVINTNSRF----------DYP-----------DSRASQLSAALNYLR---TSSPSAVRARLDANRLAVAGHSMGGGGTLRIAEQNP----SLKAAVPLT-----------------------------------------PW--------HTDK--------------------------------------TF-NTSVPVLIVGAEADTVAPVSQHAIPFYQNLPSTTPKVYVELDNAS---HFAPNSN-NAAISVYTISWMKLWVDNDTRYRQFLCN------VNDPALSDFRT-NNRHCQ
>307
----------------------------------------------------------------------------------------------------------------------------------------------------------------------QT------------VTSMLKDLDAVITQ---VSEKF-----PQIDNKRVCLIGHSQGAYVSFLHATK----DERIKCLVSWMGRL----S---DLKEFW-----SKLWFDEIE--------RKGY---------IYEW----------------DYKITKKY-VRDSLKYNLSKAAWR-IKVPTLLIYGELDDIVPPS-EGMKFYRNIKS--PKKIVIVKDLN---HTFSGEKAKKSVIRITLKWLSKWLKRLD--------------------------------
>102
QSPAQSSAPTVELDSGAIAGSTADGVVSFKGIPYAAPPVGNLRWRAPQPVASWTGVRAATEYGYDCIQLPLEGDAAASG------------GEMSEDCLVLNVWRPAEIAPGERLPVLVWIHGGGFLNGSAAAPIYDGTAFAQQGLVVVSFN------YRLGRLGFFAHPALTAANEGPLGNYGLMDQIAALEW--VQR----NIAAFGGDPARITLMGQSAGG-ISVMYHLTAPESQGLFHQAAVLSGGGRTYLLGLRNLRESTDALPSAEQ--SGLAFGRRFGIRGR-------------GRAALRSLRSLSA--EEVNGDLSMAALVEKPADY---AG---------------------------------------------------------------------------------------------------------------

3di

>504
------------------------------------------------------------DPPLDDADQDDLVCLLD-QAPFDKDKDWFPCVNFDLF-RTEMKMAGQ--D-DDAAAEEEEEE--AQLDALQ-QAVLVRRLLRSNGHTYGDTHGPDSHD----------AL-----------CSLLRSRVRVLVVLVVLCCDPPRSNPVHHPSQQYAYEYAARNLSNRLLNQLVCV----SYFEYERELY-----------------------------------------D--------DPDLQS-------------------------------------QLRRQHAYEYEAECAEPPSHCVNPVVSNQVSHDLNHWYKYWYFPPDYS-SCNTSVVVVSSLCSQQVSLRSCCGSSVDCSSQCCPANDVVVVSCPDPGTPDMDTS-DDD--
>IsPETase
------------------------------------------------------------DQQLDWDDQDDLVQQLDLAHQFDKDKDKDP--DQDLF-QTKMKIAGP--PRPAAAAEEEEEE--AQLDFQQ-FAVSLRRLQRRNPYIYMYGHGPDSHD----------AL-----------QSLLRSRVSVLVVVVVLCCPVPHPNPVRHDSQFYEYEYAASSLSNRLVNQLVPV----SHQEYEYEQY-----------------------------------------A--------HPDLQ--------------------------------------SQSHAHAYEYEYECPEPVNHCVRGVVSNQVNHD-NYWYKYWYFPNGYS-SCPTPPHPRSSLSSSQVNLRSCCGRSVDVSSLCSPA---DD-P-PDPRTPDIDTDPRDDDD
>202_A
------------------------------------------------------------------DDDD------------QWDWDWF---DFPNKIWIKIKGAQPPDDPQAAFAEEEEA--EDDLDAQVALPVLLSVLQSVLGYIYMY-IGQDQHDPIHHVHQADQ------------LVSSLSVSVRVLVVQCP----D-----SRHPQQRYAYEYEECGLLSSLVNLLV----DVSHQAYERELYQQ----ALQVLQCLLDDV-VVSVVLVVVLVVQVVVCVVPNAFDWAFSCVLQVDDPVVVVVVVVVCVVVVSSGGTYGSVN-SVSSNRRGSLVRLCSSPLRAYEYEYECQEPRRHCV-RVVSSDVSNPDSYYYHYD--YPDYSPPCSDSVDPNVVVVSVVVSVVCCVRRNRD---------------------------------
>202_B
---------------------------------------------------------------------D------------AWDFDWF---DFPRKIWTKIKGAAPPDDAQAAAAEEEEA--EDDLDAQVALPVLLQVLQSVLGYIYMY-IGQDQHDPIHHVHQADQ------------LVSSLSVSVRVLVVQCP----D-----SRHPQQRYAYEYEESGLLSSLVNLLV----DVSHLAYERELYFQ----ALQVLQCLLDDV-VLSVVLVVVLVVQVVCCVVPNAFDWAFSCVLQVDDPVVVVVVVVVCVVVVSSGGTYGSVN-SVSSNRRGSLVRNLSSPLRAYEYEYECQEPRRHCV-RVVSSDVSNDPRYHYDYD--YPDYRPPCSDSVDPNVVVVSVVVSVVCCVRRND----------------------------------
>611
---------------------------------------------------------------DADDDLDDLCQQLPLAHQFDKDKDWFQCVNFPL-FGTFIKIAG-PDQPGAAAAEEEEEE--AAPDAL-LLQVSCRRLQRRNPHTYTSTGGPHSH----------DAQ-----------CSLLSSRVRVLVCCC---PPDP--CVVRHDNQQYAYEYAASNLSSRLNNQLVPV----SHFEYEREL-----------------------------------------YD--------DPDL--------------------------------------ASCSRQHAYEYEHECPEPPSHCVNHVVSNQVNHDQPQKYKYWYFPPDY---SSPSRDD-DSVCSSVVNLRSSCGRVVNCSSLVPPPPDDD---PDP-GTPDMDM-SPDRD-
>TfCut2
------------------------------------------------------------DQPLDDDDQDDLVQQLDLAHQFDKDKDWFCCVQFDL-FHTWIKIAG-PDA--DAHEEEEEEE--AQQDAL-LLQVSLNNRLRRNGYTYTRTHGPDSH----------DAL-----------CSLLRSSVSVLCCQC---PPDDPVRNNRYPSQQYEYEYAASSLSSRLVNQLVCV----SYQEYEREL-----------------------------------------YH--------DPDL--------------------------------------ASCSGQHQYEYEHECAEPPSHCVNHVVSNQVNHDLPHWYKYWYFAPDY---SSPSSPS-ASVCRNVVNLSSCCGSSVHCSSLVCQVVHPQ---PPPPTTPGMDT-SPPGPD
>LCC
------------------------------------------------------------DFPLDDADLDDLVQLLDQ-ADFDWDKDWFCQVVDDL-FRTFMKIAG-PPD-QEAAEEEEEEE--AQPDAL-LLQVVVNRRQRSRQYTYTYTHGPDSH----------DAL-----------CSLLRSSVRVLVCQQ---PPDDPVSNSRHPSQFYAYEYAARRLSSQLANQLVDQ----SYQEYEYEC-----------------------------------------YY--------DQDL--------------------------------------AG-EHQHAYEYEHECPEPVNHCVRHVVSNQVRYDLNHWYKYFYFDNGY---SSQSRDR-DSVVSLVVNLSCCCGRRVDCSSVVPVPP------DDDPRTPDMDT-SCPSPD
>307
----------------------------------------------------------------------------------------------------------------------------------------------------------------------DA------------LVVVLVVVVVVLVC---CVVVP-----VVDDLQQDEDEAAACGLCSRLQNLLP----DNSHAEYEYELYQQ----A---APVVLD-----DPVLVVVCV--------VPQW---------DDDP----------------NDTDGPVR-NVRSVVDGRLVSLQS-NQHEYEYEYECAAPRRHCV-SVVSSQVRYDH--NYDYDYFYPQY---SVGPDPVSVVVVVVVVVVVSCVRRDDDD--------------------------------
>102
DPPPPDPFDWFAAPQGIEGFDDDDQKTKFAFAALFDQCAQVNQFHQTHGDHHDYDYHYRHDHDAAADADDDPPPLQDQP------------GHYDSRFLGKMKMAGPDDPVPAAFAEEEEDEAFPLRDGDLNRPVNDCVVVRVVGYIYMYGH------FHGWCRQAAAALAVVQVPRHDGGHNRLSSVLSVLVS--CLR----GVVSVRHDQQRYEQEYAACSL-LSLLCLVPDPSNPSSHDHYYRHHHDHPDDPLDAAAQCDADPVAHHSNQ--LQCVLLVVVVQNDN-------------HPVSVNVVSPDDS--CSSNPPDGSVCSVVNDNNT---HD---------------------------------------------------------------------------------------------------------------

fold_align_3di.txt
fold_align_aa.txt

martin-steinegger · 2025-05-10T15:00:03Z

Thank you for sharing your example. I've resolved an issue regarding profile searches with custom MSAs #452 and successfully performed a search using the following steps:

# download Swissprot
foldseek databases Alphafold/Swiss-Prot sprot tmp

# Rename input files to fasta (needed for reformat.pl)
mv fold_align_3di.txt fold_align_3di.fa
mv fold_align_aa.txt fold_align_aa.fa

# Convert FASTA alignments to Stockholm format
wget https://raw.githubusercontent.com/soedinglab/hh-suite/refs/heads/master/scripts/reformat.pl
chmod +x reformat.pl
./reformat.pl fold_align_3di.fa ~/Downloads/fold_align_3di.sto
./reformat.pl fold_align_aa.fa ~/Downloads/fold_align_aa.sto
foldseek convertmsa fold_align_aa.sto query_msa_aa
foldseek convertmsa fold_align_3di.sto query_msa_ss

# Generate profiles from MSAs
# Since the global MSA was produced by FoldMason, I recommend --match-mode 1
# This keeps every column with residues present in more than 50% of sequences.
foldseek msa2profile query_msa_ss query_profile_ss \
    --pca 1.4 --pcb 1.5 --comp-bias-corr 0 --match-mode 1

foldseek msa2profile query_msa_aa query_profile \
    --pca 1.1 --pcb 4.1 --comp-bias-corr 1 --comp-bias-corr-scale 1.0 \
    --match-mode 1 --sub-mat blosum62.out --seed-sub-mat blosum62.out

# Execute the Foldseek profile search
foldseek search query_profile sprot aln tmp

--match-mode
0: Only columns with residues in the first sequence are kept.
1: Columns with residues in at least --match-ratio of all sequences are retained.

Please let me know if this works for you.

martin-steinegger mentioned this issue Apr 17, 2025

Profile-seq search? #449

Closed

martin-steinegger added a commit that referenced this issue May 10, 2025

Fix msa2profile related issues #452

528c76a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convertmsa and msa2profile for 3di alignemnts? #452

Convertmsa and msa2profile for 3di alignemnts? #452

EvanKomp commented Apr 16, 2025

martin-steinegger commented Apr 17, 2025 •

edited

Loading

EvanKomp commented Apr 23, 2025 •

edited

Loading

EvanKomp commented May 7, 2025

martin-steinegger commented May 7, 2025 •

edited

Loading

EvanKomp commented May 8, 2025

martin-steinegger commented May 10, 2025

Convertmsa and msa2profile for 3di alignemnts? #452

Convertmsa and msa2profile for 3di alignemnts? #452

Comments

EvanKomp commented Apr 16, 2025

martin-steinegger commented Apr 17, 2025 • edited Loading

EvanKomp commented Apr 23, 2025 • edited Loading

EvanKomp commented May 7, 2025

martin-steinegger commented May 7, 2025 • edited Loading

EvanKomp commented May 8, 2025

martin-steinegger commented May 10, 2025

martin-steinegger commented Apr 17, 2025 •

edited

Loading

EvanKomp commented Apr 23, 2025 •

edited

Loading

martin-steinegger commented May 7, 2025 •

edited

Loading