Skip to content

Questions About Building a Pan-TE Library for Multiple Genomes Using Earl Grey #194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aaannaw opened this issue Apr 27, 2025 · 5 comments

Comments

@aaannaw
Copy link

aaannaw commented Apr 27, 2025

Dear Dr. Baril,

Thank you for developing Earl Grey—it’s an incredibly useful tool for TE annotation! I’m writing to seek your advice on a few specific questions regarding the workflow for building a pan-TE library across multiple genomes within a family.

  1. Context: Pan-TE Library Construction
    We plan to:

Run earlGreyLibConstruct individually for each genome in our dataset (a taxonomic family) to generate de novo TE consensus sequences via the BEAT process.

Merge the resulting libraries into a single pan-TE library for downstream annotation.

  1. Questions on Library Merging and Redundancy Removal
    In your GitHub documentation, you mentioned manually curating merged libraries (e.g., using cat and cd-hit-est). We’d appreciate your guidance on:

Recommended parameters for cd-hit-est:

Should we use default parameters (e.g., -c 0.8 -n 5), or are there TE-specific adjustments (e.g., identity threshold, word size) you suggest?

Incorporating RepBase/Dfam:

Should we merge RepBase/Dfam consensus sequences directly into our pan-TE library before running cd-hit-est?

Or is it sufficient to only use the -r (RepeatMasker species) parameter during earlGreyAnnotationOnly, letting RepeatMasker handle database repeats separately?

  1. Role of LTRfinder in earlGreyAnnotationOnly
    Your GitHub flowchart shows LTRfinder being used during the annotation step (earlGreyAnnotationOnly) but not during library construction. Could you clarify:

Why is LTRfinder not used in earlGreyLibConstruct? Does the BEAT process already cover LTR detection?

How does LTRfinder augment the annotation when the consensus library is already built? (e.g., does it refine LTR boundaries or identify novel LTRs missed by the library?)

Looking forward with your reply!
Best wishes,
Na Wan

@TobyBaril
Copy link
Owner

Hi @aaannaw,

In this case, the best approach is to generate libraries for each species, then to combine these into a single fasta file and reduce redundancy. You can find guidance for curation and library redundancy reduction here. The supplementary file contains a list of protocols that can help to refine libraries.

Unless there are already well-curated TE sequences for your species of interest, or closely related ones, the best approach is to stick to de novo detection for all your species, then to combine these and use a non-redundant set for all annotation (as this will mean family names are then consistent across all annotations etc). I would recommend adding an identifier to the TE library fasta headers before combining the files from each species so you know which species each consensus came from.

I would not recommend using the -r option if you want to detect high-quality de novo repeats in the first instance. You can always BLAST your final library against Dfam to see if you have any existing and well-curated families, but using this before de novo detection can reduce the quality of resultant de novo TE families and make some families look much older than they actually are as there is less information available to the de novo detection phases.

LTRFinder is employed in the annotation phase as it annotates individual TE copies, rather than curating consensus sequences. Therefore, it does not make sense to have these in the libraries earlier and this is not the intended purpose. RepeatModeler2 will detect some LTR families and solo LTRs. Annotating your genomes post-curation using the AnnotationOnly pipeline will allow these LTR annotations to be replaced with full-length LTRs detected by LTR_FINDER, which are not always well-annotated with RepeatMasker due to their length and complexity.

I hope this helps, and good luck with the project!

@aaannaw
Copy link
Author

aaannaw commented Apr 28, 2025

Dear Dr. Baril,

Thank you for your detailed advice on TE annotation. I’m currently working on the first step—building de novo TE libraries for each target species using earlGreyLibConstruct. However, I’d like to clarify two points based on your recommendations:

Regarding the -r option:

Your note mentioned "I would not recommend using the -r option if you want to detect high-quality de novo repeats in the first instance."

My current command includes -r Rodentia (to reference Dfam/RepBase for Rodentia):

singularity exec [...] earlGreyLibConstruct -g Fdm.fasta -s Fdm -o ./output -t 30 -r Rodentia > log
Question: Does your suggestion to avoid -r apply specifically to the library construction step (earlGreyLibConstruct), or only later during the earlGreyAnnotationOnly phase? In other words, should I remove -r Rodentia in the above command?

Database combining:

You recommended combining de novo libraries across species but didn’t combine with Dfam/RepBase.

Question: Since my target species and close relatives lack well-curated TE libraries, should I completely omit Dfam/RepBase (-r) throughout the entire workflow (both library construction and annotation)? Or is there a later stage where referencing these databases becomes useful?

Thank you for your time!
Best regards,

@TobyBaril
Copy link
Owner

Hi @aaannaw,

In this case, if there are no existing well-curated sequences for your species and/or closely related ones, you would be better avoiding -r throughout the whole workflow. This will provide more information for the de novo detection step, resulting in higher quality TE consensus sequences and more accurate TE age estimates. RepeatClassifier will still use your Dfam libraries to aid in the classification of de novo families (without needing to specify this in your earlGrey command), which should help to reduce the number of unclassified families in your final libraries.

Therefore, for your case I would proceed with a classic de novo earl grey run for each input, then combine the .strained files for each species, and reduce redundancy following the protocol in the link from my previous comment.

I hope this helps!

Toby

@aaannaw
Copy link
Author

aaannaw commented Apr 28, 2025

Dear Dr. Baril,

Thank you for your prompt and helpful response! I’ll proceed with pure de novo runs (omitting -r throughout) and follow your guidance on library merging and redundancy reduction.

I’d like to clarify one additional point regarding RepeatModeler integration to optimize runtime:

(1)Reusing existing RepeatModeler outputs:

As you suggested in [Issue #131] (#131), I’ve already placed Cho-families.fa in ${OUTDIR}/${species}_Database/.

Question: Are there other RepeatModeler output files (e.g., RM_/consensi.fa.classified, round-/families.stk) that need to be copied to ${species}_Database/ or ${species}_RepeatModeler/ for earlGreyLibConstruct to fully recognize and skip redundant RepeatModeler runs?

(2)-LTRStruct parameter compatibility:

My prior RepeatModeler runs included -LTRStruct (to improve LTR detection).

Question: Is this parameter compatible with earlGreyLibConstruct’s internal RepeatModeler calls? Or should I rerun RepeatModeler without -LTRStruct to ensure consistency with EarlGrey’s expected inputs?

I want to ensure seamless integration while avoiding unintended conflicts. Thank you again for your support!

Best regards,
Na Wan

@TobyBaril
Copy link
Owner

The only file you need in Database is the -families file for Earl Grey to work properly.

If your local version of RepeatModeler includes -LTRStruct it should work in theory, but I'm not sure how this will behave with the TEstrainer curation process (could always be worth a test to see if it works!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants