Skip to content

Convertmsa and msa2profile for 3di alignemnts? #452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
EvanKomp opened this issue Apr 16, 2025 · 6 comments
Open

Convertmsa and msa2profile for 3di alignemnts? #452

EvanKomp opened this issue Apr 16, 2025 · 6 comments

Comments

@EvanKomp
Copy link

Hey all - I want to start a search with a seed alignment of know active homologs. I created the aa and 3di MSA with foldmason.

In mmseqs, I would create a profile db from the starting msa and run search on that - however in foldseek I am not sure how. I tried foldseek convertmsa and then foldseek msa2profile on the 3di alignment, however when I try to run search it says it cannot find the _ss suffix. I assume this is because creating a database like I did does not execute the required additional foldseek functionalities on to of mmseqs that the createdb on a set of PDBs does.

Any help very appreciated. I think a 3di alignment from such a workflow would be an excellent way to visualize structural space of a certain type of protein with eg. a VAE.

@martin-steinegger
Copy link
Collaborator

martin-steinegger commented Apr 17, 2025

It is not super easy yet to convert an external MSA to a profile. @sooyoung-cha, did you try to get external MSAs into profiles?
Something that works out of the box would be to use internal MSAs:

foldseek search query target aln tmp -a 
foldseek result2profile query target aln prof 
foldseek search prof target aln tmp 

In order to use an external MSA you could try (I have not tested this though):

foldseek convertmsa inputAA.fa msa
foldseek convertmsa input3Di.fa msa_ss
foldseek msa2profile msa profile

Here the MSAs need to be formatted exactly the same way, meaning the same sequence order and the same gaps per sequence.

@EvanKomp
Copy link
Author

EvanKomp commented Apr 23, 2025

@martin-steinegger Thanks for the reply.

The below got me to the point of a search that runs without error:

### convert alignment to msa database
foldseek convertmsa fold_align_aa.fa query/query_msa
foldseek convertmsa fold_align_3di.fa qiery/query_msa_ss

### convert msa database to profile database
foldseek msa2profile query/query_msa query/query_profile
foldseek msa2profile query/query_msa_ss query/query_profile_ss

### run search
foldseek search query/query_profile /datasets/UniprotKB/afdb ./search ./tmpSearch -c 0.6 --cov-mode 2 --threads 104 --alignment-type 0  --split-memory-limit 180G

However when unpacking the result:

### convert and unpack with result2msa and unpackdb
foldseek result2msa query/query_profile /datasets/UniprotKB/afdb search hit_msa --msa-format-mode 6
foldseek unpackdb hit_msa hit_msa_unpacked --unpack-suffix .a3m --unpack-name-mode 0

0.a3m is a big ol alignment but only contains the first sequence in the original msa. I assume this is user error, but it sort of seems like the search only used a single query sequence even though the query db was a profile one. Any ideas and thanks in advance?

EDIT: I hypothesized that maybe this was because of default parameters of msa2profile removing sequences from the MSA before computing the profile. I instead created both aa and ss profiles by foldseek msa2profile query/query_msa query/query_profile --match-mode 1 --match-ratio 0.8 --filter-msa 0, but now the search dies:

foldseek search query/query_profile /datasets/UniprotKB/afdb ./search ./tmpSearch --threads 104 --alignment-type 0  --split-memory-limit 180G -s 2 -a
Create directory ./tmpSearch
search query/query_profile /datasets/UniprotKB/afdb ./search ./tmpSearch --threads 104 --alignment-type 0 --split-memory-limit 180G -s 2 -a 

MMseqs Version:                         10.941cd33
Seq. id. threshold                      0
Coverage threshold                      0
Coverage mode                           0
Max reject                              2147483647
Max accept                              2147483647
Add backtrace                           true
TMscore threshold                       0
TMscore threshold mode                  0
TMalign hit order                       0
TMalign fast                            1
Preload mode                            0
Threads                                 104
Verbosity                               3
LDDT threshold                          0
Sort by structure bit score             1
Alignment type                          0
Exact TMscore                           0
Substitution matrix                     aa:3di.out,nucl:3di.out
Alignment mode                          3
Alignment mode                          0
E-value threshold                       10
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Gap open cost                           aa:10,nucl:10
Gap extension cost                      aa:1,nucl:1
Compressed                              0
Seed substitution matrix                aa:3di.out,nucl:3di.out
Sensitivity                             2
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Max results per query                   1000
Split database                          0
Split mode                              2
Split memory limit                      180G
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           0
Mask residues probability               0.999995
Mask lower case residues                1
Mask lower letter repeating N times     6
Minimum diagonal score                  30
Selected taxa                      
Spaced k-mers                           1
Spaced k-mer pattern               
Local temporary path               
Use GPU                                 0
Use GPU server                          0
Wait for GPU server                     600
Prefilter mode                          0
Exhaustive search mode                  false
Search iterations                       1
Remove temporary files                  true
MPI runner                         
Force restart with latest tmp           false
Cluster search                          0

prefilter query/query_profile_ss /datasets/UniprotKB/afdb_ss ./tmpSearch/16735541491874290893/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 2 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 180G -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.999995 --mask-lower-case 1 --mask-n-repeat 6 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 104 --compressed 0 -v 3 

Query database size: 1 type: Profile
Target split mode. Searching through 11 splits
Estimated memory consumption: 155G
Target database size: 214683829 type: Aminoacid
Process prefiltering step 1 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.56M 12s 735ms    
Index table: Masked residues: 728815011
Index table: fill
[=================================================================] 100.00% 19.56M 20s 502ms       
Index statistics
Entries:          5063035714
DB size:          38736 MB
Avg k-mer size:   3.955497
Top 10 k-mers
    VLVLVVV     3259658
    SVSVVVV     3116512
    LVVVVVV     2850604
    VVVVVVV     2705558
    VSVVVVV     1942075
    SSVVVVV     1524001
    VVVVCVV     1378131
    SSVSVVV     1368735
    SNVVVVV     1288116
    DPSVVVV     994572
Time for index table init: 0h 0m 47s 250ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 1 of 11)
Query db start 1 to 1
Target db start 1 to 19555737
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59422155 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_0: 0h 0m 0s 2ms
Time for merging to pref_tmp_0_tmp: 0h 0m 0s 2ms
Process prefiltering step 2 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.40M 10s 571ms    
Index table: Masked residues: 765704783
Index table: fill
[=================================================================] 100.00% 19.40M 20s 279ms       
Index statistics
Entries:          5022494436
DB size:          38504 MB
Avg k-mer size:   3.923824
Top 10 k-mers
    VLVLVVV     3246333
    SVSVVVV     3097487
    LVVVVVV     2837946
    VVVVVVV     2713764
    VSVVVVV     1943484
    SSVVVVV     1511240
    VSVNVVV     1405579
    SSVSVVV     1355942
    SNVVVVV     1276593
    DPSVVVV     988214
Time for index table init: 0h 0m 46s 635ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 2 of 11)
Query db start 1 to 1
Target db start 19555738 to 38959624
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58364017 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_1: 0h 0m 0s 2ms
Time for merging to pref_tmp_1_tmp: 0h 0m 0s 1ms
Process prefiltering step 3 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.42M 10s 362ms    
Index table: Masked residues: 768096778
Index table: fill
[=================================================================] 100.00% 19.42M 17s 791ms    
Index statistics
Entries:          5019242848
DB size:          38485 MB
Avg k-mer size:   3.921283
Top 10 k-mers
    VLVLVVV     3239441
    SVSVVVV     3084889
    LVVVVVV     2832356
    VVVVVVV     2709768
    VSVVVVV     1934780
    DPVVVVV     1714739
    SSVVVVV     1507868
    SSVSVVV     1348844
    SNVVVVV     1275517
    DPSVVVV     983814
Time for index table init: 0h 0m 41s 96ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 3 of 11)
Query db start 1 to 1
Target db start 38959625 to 58376168
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58258151 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_2: 0h 0m 0s 2ms
Time for merging to pref_tmp_2_tmp: 0h 0m 0s 2ms
Process prefiltering step 4 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.50M 10s 228ms    
Index table: Masked residues: 745687563
Index table: fill
[=================================================================] 100.00% 19.50M 21s 287ms       
Index statistics
Entries:          5044971177
DB size:          38633 MB
Avg k-mer size:   3.941384
Top 10 k-mers
    VLVLVVV     3253205
    SVSVVVV     3106731
    LVVVVVV     2843608
    VVVVVVV     2705200
    VSVVVVV     1939666
    SSVVVVV     1517270
    VVVVCVV     1380649
    SSVSVVV     1361537
    SNVVVVV     1284169
    DPSVVVV     991381
Time for index table init: 0h 0m 46s 137ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 4 of 11)
Query db start 1 to 1
Target db start 58376169 to 77880679
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59061544 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_3: 0h 0m 0s 2ms
Time for merging to pref_tmp_3_tmp: 0h 0m 0s 1ms
Process prefiltering step 5 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.60M 9s 917ms     
Index table: Masked residues: 742844995
Index table: fill
[=================================================================] 100.00% 19.60M 17s 779ms    
Index statistics
Entries:          5046317092
DB size:          38640 MB
Avg k-mer size:   3.942435
Top 10 k-mers
    VLVLVVV     3269420
    SVSVVVV     3111766
    LVVVVVV     2855704
    VVVVVVV     2722385
    VSVVVVV     1948106
    SSVVVVV     1518315
    VVVVCVV     1382643
    SSVSVVV     1357458
    SNVVVVV     1288378
    DPSVVVV     992054
Time for index table init: 0h 0m 41s 557ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 5 of 11)
Query db start 1 to 1
Target db start 77880680 to 97476863
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59018457 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_4: 0h 0m 0s 2ms
Time for merging to pref_tmp_4_tmp: 0h 0m 0s 1ms
Process prefiltering step 6 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.51M 10s 614ms   
Index table: Masked residues: 757250387
Index table: fill
[=================================================================] 100.00% 19.51M 18s 75ms     
Index statistics
Entries:          5030573380
DB size:          38550 MB
Avg k-mer size:   3.930135
Top 10 k-mers
    VLVLVVV     3260308
    SVSVVVV     3096683
    LVVVVVV     2853667
    VVVVVVV     2717024
    VSVVVVV     1933661
    SSVVVVV     1512613
    VVVVCVV     1377403
    SSVSVVV     1353638
    SNVVVVV     1280451
    DPSVVVV     993383
Time for index table init: 0h 0m 41s 391ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 6 of 11)
Query db start 1 to 1
Target db start 97476864 to 116986050
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58489081 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_5: 0h 0m 0s 2ms
Time for merging to pref_tmp_5_tmp: 0h 0m 0s 1ms
Process prefiltering step 7 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.59M 12s 555ms    
Index table: Masked residues: 744285405
Index table: fill
[=================================================================] 100.00% 19.59M 21s 28ms        
Index statistics
Entries:          5044354030
DB size:          38629 MB
Avg k-mer size:   3.940902
Top 10 k-mers
    VLVLVVV     3263742
    SVSVVVV     3115363
    LVVVVVV     2852244
    VVVVVVV     2720900
    VSVVVVV     1945877
    SSVVVVV     1520299
    VVVVCVV     1388615
    SSVSVVV     1357513
    SNVVVVV     1290651
    DPSVVVV     993954
Time for index table init: 0h 0m 47s 787ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 7 of 11)
Query db start 1 to 1
Target db start 116986051 to 136573459
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59114449 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_6: 0h 0m 0s 2ms
Time for merging to pref_tmp_6_tmp: 0h 0m 0s 1ms
Process prefiltering step 8 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.41M 14s 471ms    
Index table: Masked residues: 768407198
Index table: fill
[=================================================================] 100.00% 19.41M 21s 377ms       
Index statistics
Entries:          5018897018
DB size:          38483 MB
Avg k-mer size:   3.921013
Top 10 k-mers
    VLVLVVV     3231672
    SVSVVVV     3084495
    LVVVVVV     2826156
    VVVVVVV     2701578
    VSVVVVV     1933911
    SSVVVVV     1501996
    VVVVCVV     1370173
    SSVSVVV     1347443
    SNVVVVV     1270275
    DPSVVVV     989075
Time for index table init: 0h 0m 51s 710ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 8 of 11)
Query db start 1 to 1
Target db start 136573460 to 155981499
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
58320674 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_7: 0h 0m 0s 2ms
Time for merging to pref_tmp_7_tmp: 0h 0m 0s 2ms
Process prefiltering step 9 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.48M 13s 916ms    
Index table: Masked residues: 745901634
Index table: fill
[=================================================================] 100.00% 19.48M 20s 994ms       
Index statistics
Entries:          5044443419
DB size:          38630 MB
Avg k-mer size:   3.940971
Top 10 k-mers
    VLVLVVV     3254048
    SVSVVVV     3097754
    LVVVVVV     2837785
    VVVVVVV     2700264
    VSVVVVV     1929842
    SSVVVVV     1506275
    VVVVCVV     1372586
    SSVSVVV     1351086
    SNVVVVV     1288878
    DPSVVVV     989427
Time for index table init: 0h 0m 48s 419ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 9 of 11)
Query db start 1 to 1
Target db start 155981500 to 175456597
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59053382 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_8: 0h 0m 0s 2ms
Time for merging to pref_tmp_8_tmp: 0h 0m 0s 1ms
Process prefiltering step 10 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.56M 13s 475ms    
Index table: Masked residues: 723365664
Index table: fill
[=================================================================] 100.00% 19.56M 20s 933ms       
Index statistics
Entries:          5069810473
DB size:          38775 MB
Avg k-mer size:   3.960789
Top 10 k-mers
    VLVLVVV     3253276
    SVSVVVV     3113180
    LVVVVVV     2839470
    VVVVVVV     2704703
    VSVVVVV     1937111
    SSVVVVV     1520266
    VVVVCVV     1381843
    SSVSVVV     1362145
    SNVVVVV     1282927
    DPSVVVV     994258
Time for index table init: 0h 0m 47s 805ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 10 of 11)
Query db start 1 to 1
Target db start 175456598 to 195012177
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59518254 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_9: 0h 0m 0s 33ms
Time for merging to pref_tmp_9_tmp: 0h 0m 0s 1ms
Process prefiltering step 11 of 11

Index table k-mer threshold: 0 at k-mer size 7 
Index table: counting k-mers
[=================================================================] 100.00% 19.67M 13s 684ms    
Index table: Masked residues: 741626195
Index table: fill
[=================================================================] 100.00% 19.67M 17s 513ms    
Index statistics
Entries:          5045881259
DB size:          38638 MB
Avg k-mer size:   3.942095
Top 10 k-mers
    VLVLVVV     3265403
    SVSVVVV     3113183
    LVVVVVV     2857180
    VVVVVVV     2723157
    VSVVVVV     1940502
    SSVVVVV     1514450
    VVVVCVV     1390607
    SSVSVVV     1357116
    SNVVVVV     1287020
    DPSVVVV     996860
Time for index table init: 0h 0m 45s 149ms
k-mer similarity threshold: 135
Starting prefiltering scores calculation (step 11 of 11)
Query db start 1 to 1
Target db start 195012178 to 214683829
[=================================================================] 100.00% 1 eta -

2596.707965 k-mers per position
59152452 DB matches per sequence
1 overflows
128 sequences passed prefiltering per query sequence
128 median result list length
0 sequences with 0 size result lists
Time for merging to pref_tmp_10: 0h 0m 0s 2ms
Time for merging to pref_tmp_10_tmp: 0h 0m 0s 1ms
Merging 11 target splits to pref
Preparing offsets for merging: 0h 0m 0s 35ms
[=================================================================] 100.00% 1 eta -
Time for merging to pref: 0h 0m 0s 88ms
Time for merging target splits: 0h 0m 0s 274ms
Time for merging to pref_tmp: 0h 0m 0s 103ms
Time for processing: 0h 9m 29s 302ms
structurealign query/query_profile /datasets/UniprotKB/afdb ./tmpSearch/16735541491874290893/pref ./tmpSearch/16735541491874290893/strualn --tmscore-threshold 0 --tmscore-threshold-mode 0 --lddt-threshold 0 --sort-by-structure-bits 1 --alignment-type 0 --exact-tmscore 0 --sub-mat 'aa:3di.out,nucl:3di.out' -a 1 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 10 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 0.5 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 --zdrop 40 --threads 104 --compressed 0 -v 3 

Cannot find query_profile C-alpha or afdb C-alpha database
Disabling --sort-by-structure-bits
This impacts the final score and ranking of hits, but not E-values themselves. Ranking alterations primarily occur for E-values < 10^-1.
[=================================================================] 100.00% 1 eta -
./tmpSearch/16735541491874290893/structuresearch.sh: line 112: 2439915 Segmentation fault      (core dumped) $RUNNER "$MMSEQS" $ALIGNMENT_ALGO "${QUERY_ALIGNMENT}" "${TARGET_ALIGNMENT}${INDEXEXT}" "${TMP_PATH}/pref" "${TMP_PATH}/strualn" ${ALIGNMENT_PAR}
Error: Structure alignment step died```

@EvanKomp
Copy link
Author

EvanKomp commented May 7, 2025

@martin-steinegger @sooyoung-cha Please advise, starting a structure search from a foldmason alignment would be a really cool result / useful for mapping out the structural space of enzymes with a particular activity.

@martin-steinegger
Copy link
Collaborator

martin-steinegger commented May 7, 2025

It is really hard to judge from here. Could you upload the MSAs you used plesae?

@EvanKomp
Copy link
Author

EvanKomp commented May 8, 2025

@martin-steinegger You are amazing, truly. Here they are attached. There are small enough to view in text too:

amino acids

>504
------------------------------------------------------------QAQYQKGPDPTASALER-NGPFAIRSTSVSRTSVSGF-GGGRLYYPT--A-SGTYGAIAVSP--GFTGTSS-TMTFWGERLASHGFVVLVIDTITLYD----------QP-----------DSRARQLKAALDYLATQNGRSSSPIYRKVDTSRRAVAGHSMGGGGSLLAARDNP----SYKAAIPMAP-----------------------------------------W--------NTSSTA-------------------------------------FRTVSVPTMIFGCQDDSIAPVFSHAIPFYNAIPNSTRKNYVEIRNDDH-FCVMNGGGHDATLGKLGISWMKRFVDNDTRYSPFVCGAEYNRVVSSYEVSRSYNN-CPY--
>IsPETase
------------------------------------------------------------TNPYARGPNPTAASLEASAGPFTVRSFTVS--RPSGY-GAGTVYYPT--NAGGTVGAIAIVP--GYTARQS-SIKWWGPRLASHGFVVITIDTNSTLD----------QP-----------SSRSSQQMAALRQVASLNGTSSSPIYGKVDTARMGVMGWSMGGGGSLISAANNP----SLKAAAPQAP-----------------------------------------W--------DSSTN--------------------------------------FSSVTVPTLIFACENDSIAPVNSSALPIYDSMS-RNAKQFLEINGGSH-SCANSGNSNQALIGKKGVAWMKRFMDNDTRYSTFAC---EN-P-NSTRVSDFRTANCSLEH
>202_A
------------------------------------------------------------------VPVS------------PLQRVNF---YSAGYRLDGLLYTPRHLPAGERRPGVVLL--VGYTYLKTMVMPDIAKVLNAAGYVALV-FDYRGFGESEGPRGRLI------------PLEQVADARAALTFLAE----Q-----SMVDPDRLAVIGISLGGAHAITTAAL----DQRVRAVVALEPPG----HGARWLRSLRRH-WEWRQFLSRLAEDRRQRVLSGGSTMVDPLEIVLPDPESQAFLDQVAAEFPQMKVTLPLES-AEALIEYVSEDLAGRIAPRPLLIIHSDADQLVPVA-EAQAIAERAGSSAQLEII--PGMSHFNWVMPGSPGFTRVTDSIVKFLRNTLPVS---------------------------------
>202_B
---------------------------------------------------------------------S------------PLQRVNF---YSAGYRLDGLLYTPRHLPAGERRPGVVLL--VGYTYLKTMVMPDIAKVLNAAGYVALV-FDYRGFGESEGPRGRLI------------PLEQVADARAALTFLAE----Q-----SMVDPDRLAVIGISLGGAHAITTAAL----DQRVRAVVALEPPG----HGARWLRSLRRH-WEWRQFLSRLAEDRRQRVLSGGSTMVDPLEIVLPDPESQAFLDQVAAEFPQMKVTLPLES-AEALIEYVSEDLAGRIAPRPLLIIHSDADQLVPVA-EAQAIAERAGSSAQLEII--PGMSHFNWVMPGSPGFTRVTDSIVKFLRNTLPV----------------------------------
>611
---------------------------------------------------------------DVHGPDPTEESITAPRGPFEVDEESVSRLSVSG-FGGGTIYYP-TDTTDGLFSAVSISP--GFTGTQ-ETMAWYGPRLASQGFVVFTIDTITTT----------DQP-----------DSRARQLQASLDYLV---NDSD--VKDIIDPARLGVMGHSMGGGGSLKAALDNP----ALKAAIPLT-----------------------------------------PW--------HTTK--------------------------------------DFSGVQTPTLIIGAQNDTVAPVSQHAKPFYESLPDDPGKAYLELAGAS---HLAPNTD-NTTIAKFSIAWLKRFLDDDTRYDQFLCPPPE---NDD-SISDYQS-TCPYL-
>TfCut2
------------------------------------------------------------ANPYERGPNPTDALLEARSGPFSVSEENVSRLSASG-FGGGTIYYP-REN--NTYGAVAISP--GYTGTE-ASIAWLGERIASHGFVVITIDTITTL----------DQP-----------DSRAEQLNAALNHMI---NRASSTVRSRIDSSRLAVMGHSMGGGGSLRLASQRP----DLKAAIPLT-----------------------------------------PW--------HLNK--------------------------------------NWSSVTVPTLIIGADLDTIAPVATHAKPFYNSLPSSISKAYLELDGAT---HFAPNIP-NKIIGKYSVAWLKRFVDNDTRYTQFLCPGPR---GLFGEVEEYRS-TCPFYP
>LCC
------------------------------------------------------------SNPYQRGPNPTRSALTAD-GPFSVATYTVSRLSVSG-FGGGVIYYP-TGT-SLTFGGIAMSP--GYTADA-SSLAWLGRRLASHGFVVLVINTNSRF----------DYP-----------DSRASQLSAALNYLR---TSSPSAVRARLDANRLAVAGHSMGGGGTLRIAEQNP----SLKAAVPLT-----------------------------------------PW--------HTDK--------------------------------------TF-NTSVPVLIVGAEADTVAPVSQHAIPFYQNLPSTTPKVYVELDNAS---HFAPNSN-NAAISVYTISWMKLWVDNDTRYRQFLCN------VNDPALSDFRT-NNRHCQ
>307
----------------------------------------------------------------------------------------------------------------------------------------------------------------------QT------------VTSMLKDLDAVITQ---VSEKF-----PQIDNKRVCLIGHSQGAYVSFLHATK----DERIKCLVSWMGRL----S---DLKEFW-----SKLWFDEIE--------RKGY---------IYEW----------------DYKITKKY-VRDSLKYNLSKAAWR-IKVPTLLIYGELDDIVPPS-EGMKFYRNIKS--PKKIVIVKDLN---HTFSGEKAKKSVIRITLKWLSKWLKRLD--------------------------------
>102
QSPAQSSAPTVELDSGAIAGSTADGVVSFKGIPYAAPPVGNLRWRAPQPVASWTGVRAATEYGYDCIQLPLEGDAAASG------------GEMSEDCLVLNVWRPAEIAPGERLPVLVWIHGGGFLNGSAAAPIYDGTAFAQQGLVVVSFN------YRLGRLGFFAHPALTAANEGPLGNYGLMDQIAALEW--VQR----NIAAFGGDPARITLMGQSAGG-ISVMYHLTAPESQGLFHQAAVLSGGGRTYLLGLRNLRESTDALPSAEQ--SGLAFGRRFGIRGR-------------GRAALRSLRSLSA--EEVNGDLSMAALVEKPADY---AG---------------------------------------------------------------------------------------------------------------

3di

>504
------------------------------------------------------------DPPLDDADQDDLVCLLD-QAPFDKDKDWFPCVNFDLF-RTEMKMAGQ--D-DDAAAEEEEEE--AQLDALQ-QAVLVRRLLRSNGHTYGDTHGPDSHD----------AL-----------CSLLRSRVRVLVVLVVLCCDPPRSNPVHHPSQQYAYEYAARNLSNRLLNQLVCV----SYFEYERELY-----------------------------------------D--------DPDLQS-------------------------------------QLRRQHAYEYEAECAEPPSHCVNPVVSNQVSHDLNHWYKYWYFPPDYS-SCNTSVVVVSSLCSQQVSLRSCCGSSVDCSSQCCPANDVVVVSCPDPGTPDMDTS-DDD--
>IsPETase
------------------------------------------------------------DQQLDWDDQDDLVQQLDLAHQFDKDKDKDP--DQDLF-QTKMKIAGP--PRPAAAAEEEEEE--AQLDFQQ-FAVSLRRLQRRNPYIYMYGHGPDSHD----------AL-----------QSLLRSRVSVLVVVVVLCCPVPHPNPVRHDSQFYEYEYAASSLSNRLVNQLVPV----SHQEYEYEQY-----------------------------------------A--------HPDLQ--------------------------------------SQSHAHAYEYEYECPEPVNHCVRGVVSNQVNHD-NYWYKYWYFPNGYS-SCPTPPHPRSSLSSSQVNLRSCCGRSVDVSSLCSPA---DD-P-PDPRTPDIDTDPRDDDD
>202_A
------------------------------------------------------------------DDDD------------QWDWDWF---DFPNKIWIKIKGAQPPDDPQAAFAEEEEA--EDDLDAQVALPVLLSVLQSVLGYIYMY-IGQDQHDPIHHVHQADQ------------LVSSLSVSVRVLVVQCP----D-----SRHPQQRYAYEYEECGLLSSLVNLLV----DVSHQAYERELYQQ----ALQVLQCLLDDV-VVSVVLVVVLVVQVVVCVVPNAFDWAFSCVLQVDDPVVVVVVVVVCVVVVSSGGTYGSVN-SVSSNRRGSLVRLCSSPLRAYEYEYECQEPRRHCV-RVVSSDVSNPDSYYYHYD--YPDYSPPCSDSVDPNVVVVSVVVSVVCCVRRNRD---------------------------------
>202_B
---------------------------------------------------------------------D------------AWDFDWF---DFPRKIWTKIKGAAPPDDAQAAAAEEEEA--EDDLDAQVALPVLLQVLQSVLGYIYMY-IGQDQHDPIHHVHQADQ------------LVSSLSVSVRVLVVQCP----D-----SRHPQQRYAYEYEESGLLSSLVNLLV----DVSHLAYERELYFQ----ALQVLQCLLDDV-VLSVVLVVVLVVQVVCCVVPNAFDWAFSCVLQVDDPVVVVVVVVVCVVVVSSGGTYGSVN-SVSSNRRGSLVRNLSSPLRAYEYEYECQEPRRHCV-RVVSSDVSNDPRYHYDYD--YPDYRPPCSDSVDPNVVVVSVVVSVVCCVRRND----------------------------------
>611
---------------------------------------------------------------DADDDLDDLCQQLPLAHQFDKDKDWFQCVNFPL-FGTFIKIAG-PDQPGAAAAEEEEEE--AAPDAL-LLQVSCRRLQRRNPHTYTSTGGPHSH----------DAQ-----------CSLLSSRVRVLVCCC---PPDP--CVVRHDNQQYAYEYAASNLSSRLNNQLVPV----SHFEYEREL-----------------------------------------YD--------DPDL--------------------------------------ASCSRQHAYEYEHECPEPPSHCVNHVVSNQVNHDQPQKYKYWYFPPDY---SSPSRDD-DSVCSSVVNLRSSCGRVVNCSSLVPPPPDDD---PDP-GTPDMDM-SPDRD-
>TfCut2
------------------------------------------------------------DQPLDDDDQDDLVQQLDLAHQFDKDKDWFCCVQFDL-FHTWIKIAG-PDA--DAHEEEEEEE--AQQDAL-LLQVSLNNRLRRNGYTYTRTHGPDSH----------DAL-----------CSLLRSSVSVLCCQC---PPDDPVRNNRYPSQQYEYEYAASSLSSRLVNQLVCV----SYQEYEREL-----------------------------------------YH--------DPDL--------------------------------------ASCSGQHQYEYEHECAEPPSHCVNHVVSNQVNHDLPHWYKYWYFAPDY---SSPSSPS-ASVCRNVVNLSSCCGSSVHCSSLVCQVVHPQ---PPPPTTPGMDT-SPPGPD
>LCC
------------------------------------------------------------DFPLDDADLDDLVQLLDQ-ADFDWDKDWFCQVVDDL-FRTFMKIAG-PPD-QEAAEEEEEEE--AQPDAL-LLQVVVNRRQRSRQYTYTYTHGPDSH----------DAL-----------CSLLRSSVRVLVCQQ---PPDDPVSNSRHPSQFYAYEYAARRLSSQLANQLVDQ----SYQEYEYEC-----------------------------------------YY--------DQDL--------------------------------------AG-EHQHAYEYEHECPEPVNHCVRHVVSNQVRYDLNHWYKYFYFDNGY---SSQSRDR-DSVVSLVVNLSCCCGRRVDCSSVVPVPP------DDDPRTPDMDT-SCPSPD
>307
----------------------------------------------------------------------------------------------------------------------------------------------------------------------DA------------LVVVLVVVVVVLVC---CVVVP-----VVDDLQQDEDEAAACGLCSRLQNLLP----DNSHAEYEYELYQQ----A---APVVLD-----DPVLVVVCV--------VPQW---------DDDP----------------NDTDGPVR-NVRSVVDGRLVSLQS-NQHEYEYEYECAAPRRHCV-SVVSSQVRYDH--NYDYDYFYPQY---SVGPDPVSVVVVVVVVVVVSCVRRDDDD--------------------------------
>102
DPPPPDPFDWFAAPQGIEGFDDDDQKTKFAFAALFDQCAQVNQFHQTHGDHHDYDYHYRHDHDAAADADDDPPPLQDQP------------GHYDSRFLGKMKMAGPDDPVPAAFAEEEEDEAFPLRDGDLNRPVNDCVVVRVVGYIYMYGH------FHGWCRQAAAALAVVQVPRHDGGHNRLSSVLSVLVS--CLR----GVVSVRHDQQRYEQEYAACSL-LSLLCLVPDPSNPSSHDHYYRHHHDHPDDPLDAAAQCDADPVAHHSNQ--LQCVLLVVVVQNDN-------------HPVSVNVVSPDDS--CSSNPPDGSVCSVVNDNNT---HD---------------------------------------------------------------------------------------------------------------

fold_align_3di.txt
fold_align_aa.txt

@martin-steinegger
Copy link
Collaborator

Thank you for sharing your example. I've resolved an issue regarding profile searches with custom MSAs #452 and successfully performed a search using the following steps:

# download Swissprot
foldseek databases Alphafold/Swiss-Prot sprot tmp

# Rename input files to fasta (needed for reformat.pl)
mv fold_align_3di.txt fold_align_3di.fa
mv fold_align_aa.txt fold_align_aa.fa

# Convert FASTA alignments to Stockholm format
wget https://raw.githubusercontent.com/soedinglab/hh-suite/refs/heads/master/scripts/reformat.pl
chmod +x reformat.pl
./reformat.pl fold_align_3di.fa ~/Downloads/fold_align_3di.sto
./reformat.pl fold_align_aa.fa ~/Downloads/fold_align_aa.sto
foldseek convertmsa fold_align_aa.sto query_msa_aa
foldseek convertmsa fold_align_3di.sto query_msa_ss

# Generate profiles from MSAs
# Since the global MSA was produced by FoldMason, I recommend --match-mode 1
# This keeps every column with residues present in more than 50% of sequences.
foldseek msa2profile query_msa_ss query_profile_ss \
    --pca 1.4 --pcb 1.5 --comp-bias-corr 0 --match-mode 1

foldseek msa2profile query_msa_aa query_profile \
    --pca 1.1 --pcb 4.1 --comp-bias-corr 1 --comp-bias-corr-scale 1.0 \
    --match-mode 1 --sub-mat blosum62.out --seed-sub-mat blosum62.out

# Execute the Foldseek profile search
foldseek search query_profile sprot aln tmp

--match-mode
0: Only columns with residues in the first sequence are kept.
1: Columns with residues in at least --match-ratio of all sequences are retained.

Please let me know if this works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants