Skip to content

Issue with Gene Pair Discrepancy #773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ChuanzhengWei opened this issue Apr 22, 2025 · 8 comments
Open

Issue with Gene Pair Discrepancy #773

ChuanzhengWei opened this issue Apr 22, 2025 · 8 comments

Comments

@ChuanzhengWei
Copy link

I am using JCVI, which is a very powerful tool, but I have encountered some issues. I am analyzing four varieties (A, B, C, D) of the same species using the same code for gene pair analysis. However, the results show significant differences in the number of gene pairs, as follows:

  • A vs B: 26,733 gene pairs
  • A vs C: 26,644 gene pairs
  • A vs D: 26,663 gene pairs
  • B vs C: 15,791 gene pairs
  • B vs D: 15,721 gene pairs

For example, A#gene1 and B#gene1 are identified as a gene pair, A#gene1 and C#gene1 are also identified as a gene pair, but B#gene1 and C#gene1 are not identified as a gene pair.

I would like to understand why this discrepancy is occurring and how to resolve this issue. This problem seems to be related to a similar issue. #770

Could you please provide guidance on how to address this? Thank you very much for your assistance.

@tanghaibao
Copy link
Owner

@ChuanzhengWei

Similar to the commonly used BLAST tool, sometimes the match is not transitive.

For example,

A is a fusion protein between B and C, and let's say 1, 2, 3 is the species.

Species 1: AAAAAAAAAAA
Species 2: BBBB
Species 3:      CCCCCC

You can then have matches between A1-B2, A1-C3, but B2-C3 are not homologs. There are other cases but in general, transitivity can be broken.

@ChuanzhengWei
Copy link
Author

ChuanzhengWei commented Apr 24, 2025

To investigate whether this issue exists, I used the following code to identify CDS sequences that are present in A vs B and A vs C gene pairs but not in the B vs C gene pairs. I then performed a BLAST search to check the matches, and I found that most of the sequences matched 100%, suggesting that they are not fusion proteins. However, despite these sequences being 100% identical, JCVI did not recognize them as gene pairs.

I have attached the code I used, as well as a portion of the results, for your review.

grep -v '#' A.B.anchors.new | cut -f 2 | sort -u > tmp1
grep -v '#' A.C.anchors.new | cut -f 2 | sort -u > tmp2
grep -v '#' B.C.anchors.new | cut -f 1 | sort -u > tmp3
grep -v '#' B.C.anchors.new | cut -f 2 | sort -u > tmp4
grep -Fxv -f tmp3 tmp1 > B_bed
grep -Fxv -f tmp4 tmp2 > C_bed
seqtk subseq B.cds B_bed > B.fasta
seqtk subseq C.cds C_bed > C.fasta
makeblastdb -in B.fasta -dbtype nucl -parse_seqids
blastn -query C.fasta -out blast.txt -db B.fasta -outfmt "6 qseqid sseqid pident length evalue qlen slen" -evalue 1e-5 -max_hsps 1 -max_target_seqs 1
g27	g27	100.000	1743	0.0	1743	1743
g28	g28	99.701	1671	0.0	1671	1668
g29	g29	100.000	2169	0.0	2169	2169
g30	g30	100.000	1032	0.0	1032	1032
g31	g31	99.926	1350	0.0	1350	1350
g32	g32	100.000	1137	0.0	1137	1137

I would appreciate any insights you can provide on why JCVI is not identifying these sequences as gene pairs and how I might resolve this issue.

@tanghaibao
Copy link
Owner

Take a look at the file *.last.filtered and see if the gene pairs are removed during the filtering.

There are two filters:

                    DEBUG    Load BLAST file `grape.peach.last` (total 515965 lines)         blastfilter.py:44
                    DEBUG    Load file `grape.peach.last`                                           base.py:34
[04/23/25 19:22:06] DEBUG    running the cscore filter (cscore>=0.70) ..                    blastfilter.py:106
                    DEBUG    after filter (379462->50587) ..                                blastfilter.py:108
                    DEBUG    running the local dups filter (tandem_Nmax=10) ..              blastfilter.py:113
[04/23/25 19:22:07] DEBUG    after filter (50587->21698) ..                                 blastfilter.py:153
                    DEBUG    Assuming --qbed=grape.bed --sbed=peach.bed                         synteny.py:389
                    DEBUG    Load file `grape.bed`                                                  base.py:34
                    DEBUG    Load file `peach.bed`                                                  base.py:34
                    DEBUG    Load file `grape.peach.last.filtered`                                  base.py:34
                    DEBUG    A total of 21698 BLAST imported from `grape.peach.last.filtered`.  synteny.py:276

In particular, the tandem filter may also be a reason for removal, which is what's happening in #770.

@tanghaibao
Copy link
Owner

Also, when you check the gene pairs, make sure to also check *.lifted.anchors as well to see if B#gene1 and C#gene1 are included in that file.

@ChuanzhengWei
Copy link
Author

It seems that it was filtered out at this step. I did not find the gene pairs identified by the BLAST search in the *.lifted.anchors file.

[04/24/25 11:05:19] DEBUG    File `HDN.prj` found. Computation skipped.                                                                                                  base.py:1382
                    DEBUG    lastal -u 0 -i3G -f BlastTab -P 28 HDN HYZ.cds >./HYZ.HDN.last                                                                              base.py:1191
[04/24/25 11:05:30] DEBUG    Assuming --qbed=HYZ.bed --sbed=HDN.bed                                                                                                    synteny.py:392
                    DEBUG    Load file `HYZ.bed`                                                                                                                           base.py:36
                    DEBUG    Load file `HDN.bed`                                                                                                                           base.py:36
                    DEBUG    Load BLAST file `HYZ.HDN.last` (total 1213807 lines)                                                                                   blastfilter.py:45
                    DEBUG    Load file `HYZ.HDN.last`                                                                                                                      base.py:36
[04/24/25 11:05:32] DEBUG    running the cscore filter (cscore>=0.70) ..                                                                                           blastfilter.py:107
[04/24/25 11:05:33] DEBUG    after filter (1129125->20458) ..                                                                                                      blastfilter.py:109
                    DEBUG    running the local dups filter (tandem_Nmax=10) ..                                                                                     blastfilter.py:114
                    DEBUG    after filter (20458->18118) ..                                                                                                        blastfilter.py:154
                    DEBUG    Assuming --qbed=HYZ.bed --sbed=HDN.bed                                                                                                    synteny.py:392
                    DEBUG    Load file `HYZ.bed`                                                                                                                           base.py:36
                    DEBUG    Load file `HDN.bed`                                                                                                                           base.py:36
                    DEBUG    Load file `HYZ.HDN.last.filtered`                                                                                                             base.py:36
                    DEBUG    A total of 18118 BLAST imported from `HYZ.HDN.last.filtered`.                                                                             synteny.py:279
                    DEBUG    Chaining distance = 20                                                                                                                   synteny.py:1796

@tanghaibao
Copy link
Owner

To investigate whether this issue exists, I used the following code to identify CDS sequences that are present in A vs B and A vs C gene pairs but not in the B vs C gene pairs. I then performed a BLAST search to check the matches, and I found that most of the sequences matched 100%, suggesting that they are not fusion proteins. However, despite these sequences being 100% identical, JCVI did not recognize them as gene pairs.

I have attached the code I used, as well as a portion of the results, for your review.

grep -v '#' A.B.anchors.new | cut -f 2 | sort -u > tmp1
grep -v '#' A.C.anchors.new | cut -f 2 | sort -u > tmp2
grep -v '#' B.C.anchors.new | cut -f 1 | sort -u > tmp3
grep -v '#' B.C.anchors.new | cut -f 2 | sort -u > tmp4
grep -Fxv -f tmp3 tmp1 > B_bed
grep -Fxv -f tmp4 tmp2 > C_bed
seqtk subseq B.cds B_bed > B.fasta
seqtk subseq C.cds C_bed > C.fasta
makeblastdb -in B.fasta -dbtype nucl -parse_seqids
blastn -query C.fasta -out blast.txt -db B.fasta -outfmt "6 qseqid sseqid pident length evalue qlen slen" -evalue 1e-5 -max_hsps 1 -max_target_seqs 1
g27	g27	100.000	1743	0.0	1743	1743
g28	g28	99.701	1671	0.0	1671	1668
g29	g29	100.000	2169	0.0	2169	2169
g30	g30	100.000	1032	0.0	1032	1032
g31	g31	99.926	1350	0.0	1350	1350
g32	g32	100.000	1137	0.0	1137	1137

I would appreciate any insights you can provide on why JCVI is not identifying these sequences as gene pairs and how I might resolve this issue.

wait, are the gene names the same in different species? JCVI will remove gene pairs with the same name by default to prevent self matches.

@ChuanzhengWei
Copy link
Author

no wonder! thank you for the clarification. I will make the necessary changes and try again. Thanks!

@ChuanzhengWei
Copy link
Author

Indeed, the number of gene pairs has returned to normal. However, I have encountered a new issue. In the *.lifted.anchors file, there are some gene pairs that also exist in the .anchors files. Yet, after running the following command:
python -m jcvi.compara.synteny mcscan A.bed A.B.lifted.anchors --iter=1 -o A.B.i1.blocks
These gene pairs mysteriously disappear.(This issue occurs with the first approximately 2,000 genes.) Interestingly, this problem only occurs in one of the eight samples I am using, and I am unsure why this is happening.

file.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants