Skip to content

karel-brinda/masked-superstrings-supplement

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Masked superstrings – supplementary materials

Introduction

Here we provide supplementary materials for the paper Masked superstrings as a unified framework for textual k-mer set representations, including the used data and pipelines.

Citation

Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717

@article{sladky2023-masked-superstrings,
  title   = { Masked superstrings as a unified framework for textual $k$-mer set representations },
  author  = { Sladk{\'y}, Ond{\v r}ej and Vesel{\'y}, Pavel and B{\v r}inda, Karel },
  journal = { bioRxiv },
  volume  = { 2023.02.01.526717 },
  year    = { 2023 },
  doi     = { 10.1101/2023.02.01.526717 }
}

Methods

Superstring computation - KmerCamel🐫

Superstrings were computed using the KmerCamel🐫 program, which experimentally implements local and global greedy heuristics for masked superstring computation using hash tables and the Aho-Corasick automaton.

By default, the superstrings computed by KmerCamel🐫 come with default masks; these contain the minimal possible number of 1's (i.e., every k-mer masked on only once) and the patterns of 1's and 0's reflect the orders in which individual k-mers were added to the superstrings. These default masks are denoted by D in the paper.

Importantly, changes in the underlying data structures (hash-table vs. AC automaton), as well as changing machines or compilers, results/may result in different superstrings and their mask, and the specific choices can affect mask compressibility. For instance, hash-table-based approaches tend to produce more regular masks that are better compressible (e.g., for nearly complete de Bruijn graphs).

Mask optimization

The masks were optimized using the scripts in the experiments/08_optimize_masks directory, which implement individual mask optimization strategies.

  • maskMinNumRuns.py. Minimization of the number of runs of 1's by integer programming using the PuLP solver, as described in Appendix H. (Denoted by R in the paper.)
  • maskMaxNumOnes.py. Maximization of the number of 1's in the mask, which is achieved by performing two passes through the superstring. In the first pass, the underlying k-mer K is computed, and in the second pass, all occurrences of k-mers from K are masked on. (Denoted by O in the paper.)
  • maskMaxNumZeros.py. Greedy minimization of the number of 1's (equivalent to maximization of the number of 0's), which is achieved by by performing one pass through the data, masking on the first occurrence of each k-mer while masking off all other occurrences. (Denoted by Z in the paper.)

Experimental evaluation

Benchmark datasets

We used four datasets for experimental evaluation (uploaded in the data/ directory):

  • S. pneumoniae genome (ATCC 700669, NC_011900.1, fna online)
  • S. pneumoniae pan-genome - 616 genomes, as provided in RASE DB S. pneumoniae
    • k-mers were collected and stored in the form of simplitigs (ProphAsm v0.1.1, k=32, NS: 158,567, CL: 14,710,895 bp, #kmers: 9,795,318 32-mers)
    • The resulting file: data/spneumo_pangenome_k32.fa.xz
  • S. cerevisiae genome (S288C, fna.gz online)
  • SARS-CoV-2 pan-genome - downloaded from GISAID (access upon registration) on Jan 25, 2023 (GISAID version 2023_01_23, 14,682,066 genomes, 430 Gbp)
    • k-mers were collected using JellyFish 2 (v2.2.10, 11,701,570 32-mers) and stored in the form of simplitigs (ProphAsm v0.1.1, k=32, NS: 345,866, CL: 22,423,416 bp, #kmers: 11,701,570 32-mers)
    • The resulting file: data/sars-cov-2_pangenome_k32.fa.xz

The results in Figures 2 and 3 and Tables 1 and 2 were obtained using data in experiments/11_kmer_camel_comparison_v3/99_results/masked_superstrings_properties.kamenac.tsv.

Additionally, Tables 1 and 2 use data from experiments/12_tigs_stats/99_results/masked_superstrings_properties.kamenik.tsv.

Reproducing experimental results

After cloning this repository, run the following (besides standard Linux programs, it requires Snakemake and seqtk):

git submodule update --init # to download KmerCamel
cd experiments/11_kmer_camel_comparison_v3/
make test # a fast test that everything works
make
cd ../12_tigs_stats/
make

Figures + supplementary plots

Figures are created using Adobe Illustrator and combine individual subfigures generated using R by scripts provided in the respective directories.

Fig. 1 - Overview of masked superstrings (Sec. 2-3.6)

Fig. 2 - Comparison of superstring heuristics (Sec. 3.7.1)

Fig. 3 - Comparison of mask heuristics (Sec. 3.7.2)

Remarks

  • 11_kmer_camel_comparison_v3: Makefile specifies the number of cores for Snakemake, via argument -j. To test whether everything works, run make test first. To make it faster, limit the range of k in the Snakefile.
  • 12_tigs_stats: uses compressed FASTA files computed by the matchtigs program

Contact

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published