Issue with Gene Pair Discrepancy #773

ChuanzhengWei · 2025-04-22T16:15:28Z

I am using JCVI, which is a very powerful tool, but I have encountered some issues. I am analyzing four varieties (A, B, C, D) of the same species using the same code for gene pair analysis. However, the results show significant differences in the number of gene pairs, as follows:

A vs B: 26,733 gene pairs
A vs C: 26,644 gene pairs
A vs D: 26,663 gene pairs
B vs C: 15,791 gene pairs
B vs D: 15,721 gene pairs

For example, A#gene1 and B#gene1 are identified as a gene pair, A#gene1 and C#gene1 are also identified as a gene pair, but B#gene1 and C#gene1 are not identified as a gene pair.

I would like to understand why this discrepancy is occurring and how to resolve this issue. This problem seems to be related to a similar issue. #770

Could you please provide guidance on how to address this? Thank you very much for your assistance.

tanghaibao · 2025-04-23T23:20:22Z

@ChuanzhengWei

Similar to the commonly used BLAST tool, sometimes the match is not transitive.

For example,

A is a fusion protein between B and C, and let's say 1, 2, 3 is the species.

Species 1: AAAAAAAAAAA
Species 2: BBBB
Species 3:      CCCCCC

You can then have matches between A1-B2, A1-C3, but B2-C3 are not homologs. There are other cases but in general, transitivity can be broken.

ChuanzhengWei · 2025-04-24T02:15:52Z

To investigate whether this issue exists, I used the following code to identify CDS sequences that are present in A vs B and A vs C gene pairs but not in the B vs C gene pairs. I then performed a BLAST search to check the matches, and I found that most of the sequences matched 100%, suggesting that they are not fusion proteins. However, despite these sequences being 100% identical, JCVI did not recognize them as gene pairs.

I have attached the code I used, as well as a portion of the results, for your review.

grep -v '#' A.B.anchors.new | cut -f 2 | sort -u > tmp1
grep -v '#' A.C.anchors.new | cut -f 2 | sort -u > tmp2
grep -v '#' B.C.anchors.new | cut -f 1 | sort -u > tmp3
grep -v '#' B.C.anchors.new | cut -f 2 | sort -u > tmp4
grep -Fxv -f tmp3 tmp1 > B_bed
grep -Fxv -f tmp4 tmp2 > C_bed
seqtk subseq B.cds B_bed > B.fasta
seqtk subseq C.cds C_bed > C.fasta
makeblastdb -in B.fasta -dbtype nucl -parse_seqids
blastn -query C.fasta -out blast.txt -db B.fasta -outfmt "6 qseqid sseqid pident length evalue qlen slen" -evalue 1e-5 -max_hsps 1 -max_target_seqs 1

g27	g27	100.000	1743	0.0	1743	1743
g28	g28	99.701	1671	0.0	1671	1668
g29	g29	100.000	2169	0.0	2169	2169
g30	g30	100.000	1032	0.0	1032	1032
g31	g31	99.926	1350	0.0	1350	1350
g32	g32	100.000	1137	0.0	1137	1137

I would appreciate any insights you can provide on why JCVI is not identifying these sequences as gene pairs and how I might resolve this issue.

tanghaibao · 2025-04-24T02:23:49Z

Take a look at the file *.last.filtered and see if the gene pairs are removed during the filtering.

There are two filters:

                    DEBUG    Load BLAST file `grape.peach.last` (total 515965 lines)         blastfilter.py:44
                    DEBUG    Load file `grape.peach.last`                                           base.py:34
[04/23/25 19:22:06] DEBUG    running the cscore filter (cscore>=0.70) ..                    blastfilter.py:106
                    DEBUG    after filter (379462->50587) ..                                blastfilter.py:108
                    DEBUG    running the local dups filter (tandem_Nmax=10) ..              blastfilter.py:113
[04/23/25 19:22:07] DEBUG    after filter (50587->21698) ..                                 blastfilter.py:153
                    DEBUG    Assuming --qbed=grape.bed --sbed=peach.bed                         synteny.py:389
                    DEBUG    Load file `grape.bed`                                                  base.py:34
                    DEBUG    Load file `peach.bed`                                                  base.py:34
                    DEBUG    Load file `grape.peach.last.filtered`                                  base.py:34
                    DEBUG    A total of 21698 BLAST imported from `grape.peach.last.filtered`.  synteny.py:276

In particular, the tandem filter may also be a reason for removal, which is what's happening in #770.

tanghaibao · 2025-04-24T02:26:16Z

Also, when you check the gene pairs, make sure to also check *.lifted.anchors as well to see if B#gene1 and C#gene1 are included in that file.

ChuanzhengWei · 2025-04-24T03:08:34Z

It seems that it was filtered out at this step. I did not find the gene pairs identified by the BLAST search in the *.lifted.anchors file.

[04/24/25 11:05:19] DEBUG    File `HDN.prj` found. Computation skipped.                                                                                                  base.py:1382
                    DEBUG    lastal -u 0 -i3G -f BlastTab -P 28 HDN HYZ.cds >./HYZ.HDN.last                                                                              base.py:1191
[04/24/25 11:05:30] DEBUG    Assuming --qbed=HYZ.bed --sbed=HDN.bed                                                                                                    synteny.py:392
                    DEBUG    Load file `HYZ.bed`                                                                                                                           base.py:36
                    DEBUG    Load file `HDN.bed`                                                                                                                           base.py:36
                    DEBUG    Load BLAST file `HYZ.HDN.last` (total 1213807 lines)                                                                                   blastfilter.py:45
                    DEBUG    Load file `HYZ.HDN.last`                                                                                                                      base.py:36
[04/24/25 11:05:32] DEBUG    running the cscore filter (cscore>=0.70) ..                                                                                           blastfilter.py:107
[04/24/25 11:05:33] DEBUG    after filter (1129125->20458) ..                                                                                                      blastfilter.py:109
                    DEBUG    running the local dups filter (tandem_Nmax=10) ..                                                                                     blastfilter.py:114
                    DEBUG    after filter (20458->18118) ..                                                                                                        blastfilter.py:154
                    DEBUG    Assuming --qbed=HYZ.bed --sbed=HDN.bed                                                                                                    synteny.py:392
                    DEBUG    Load file `HYZ.bed`                                                                                                                           base.py:36
                    DEBUG    Load file `HDN.bed`                                                                                                                           base.py:36
                    DEBUG    Load file `HYZ.HDN.last.filtered`                                                                                                             base.py:36
                    DEBUG    A total of 18118 BLAST imported from `HYZ.HDN.last.filtered`.                                                                             synteny.py:279
                    DEBUG    Chaining distance = 20                                                                                                                   synteny.py:1796

tanghaibao · 2025-04-24T04:40:24Z

To investigate whether this issue exists, I used the following code to identify CDS sequences that are present in A vs B and A vs C gene pairs but not in the B vs C gene pairs. I then performed a BLAST search to check the matches, and I found that most of the sequences matched 100%, suggesting that they are not fusion proteins. However, despite these sequences being 100% identical, JCVI did not recognize them as gene pairs.

I have attached the code I used, as well as a portion of the results, for your review.
grep -v '#' A.B.anchors.new | cut -f 2 | sort -u > tmp1
grep -v '#' A.C.anchors.new | cut -f 2 | sort -u > tmp2
grep -v '#' B.C.anchors.new | cut -f 1 | sort -u > tmp3
grep -v '#' B.C.anchors.new | cut -f 2 | sort -u > tmp4
grep -Fxv -f tmp3 tmp1 > B_bed
grep -Fxv -f tmp4 tmp2 > C_bed
seqtk subseq B.cds B_bed > B.fasta
seqtk subseq C.cds C_bed > C.fasta
makeblastdb -in B.fasta -dbtype nucl -parse_seqids
blastn -query C.fasta -out blast.txt -db B.fasta -outfmt "6 qseqid sseqid pident length evalue qlen slen" -evalue 1e-5 -max_hsps 1 -max_target_seqs 1
g27	g27	100.000	1743	0.0	1743	1743
g28	g28	99.701	1671	0.0	1671	1668
g29	g29	100.000	2169	0.0	2169	2169
g30	g30	100.000	1032	0.0	1032	1032
g31	g31	99.926	1350	0.0	1350	1350
g32	g32	100.000	1137	0.0	1137	1137
I would appreciate any insights you can provide on why JCVI is not identifying these sequences as gene pairs and how I might resolve this issue.

wait, are the gene names the same in different species? JCVI will remove gene pairs with the same name by default to prevent self matches.

ChuanzhengWei · 2025-04-24T05:02:02Z

no wonder! thank you for the clarification. I will make the necessary changes and try again. Thanks!

ChuanzhengWei · 2025-04-24T05:28:06Z

Indeed, the number of gene pairs has returned to normal. However, I have encountered a new issue. In the *.lifted.anchors file, there are some gene pairs that also exist in the .anchors files. Yet, after running the following command:
python -m jcvi.compara.synteny mcscan A.bed A.B.lifted.anchors --iter=1 -o A.B.i1.blocks
These gene pairs mysteriously disappear.(This issue occurs with the first approximately 2,000 genes.) Interestingly, this problem only occurs in one of the eight samples I am using, and I am unsure why this is happening.

file.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Gene Pair Discrepancy #773

Issue with Gene Pair Discrepancy #773

ChuanzhengWei commented Apr 22, 2025

tanghaibao commented Apr 23, 2025

ChuanzhengWei commented Apr 24, 2025 •

edited

Loading

tanghaibao commented Apr 24, 2025

tanghaibao commented Apr 24, 2025

ChuanzhengWei commented Apr 24, 2025

tanghaibao commented Apr 24, 2025

ChuanzhengWei commented Apr 24, 2025

ChuanzhengWei commented Apr 24, 2025

Issue with Gene Pair Discrepancy #773

Issue with Gene Pair Discrepancy #773

Comments

ChuanzhengWei commented Apr 22, 2025

tanghaibao commented Apr 23, 2025

ChuanzhengWei commented Apr 24, 2025 • edited Loading

tanghaibao commented Apr 24, 2025

tanghaibao commented Apr 24, 2025

ChuanzhengWei commented Apr 24, 2025

tanghaibao commented Apr 24, 2025

ChuanzhengWei commented Apr 24, 2025

ChuanzhengWei commented Apr 24, 2025

ChuanzhengWei commented Apr 24, 2025 •

edited

Loading