Masked superstrings – supplementary materials

Introduction
- Citation
Methods
- Superstring computation - KmerCamel🐫
- Mask optimization
Experimental evaluation
- Benchmark datasets
- Reproducing experimental results
Figures + supplementary plots
Remarks
Contact

Introduction

Here we provide supplementary materials for the paper Masked superstrings as a unified framework for textual k-mer set representations, including the used data and pipelines.

Citation

Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717

@article{sladky2023-masked-superstrings,
  title   = { Masked superstrings as a unified framework for textual $k$-mer set representations },
  author  = { Sladk{\'y}, Ond{\v r}ej and Vesel{\'y}, Pavel and B{\v r}inda, Karel },
  journal = { bioRxiv },
  volume  = { 2023.02.01.526717 },
  year    = { 2023 },
  doi     = { 10.1101/2023.02.01.526717 }
}

Methods

Superstring computation - KmerCamel🐫

Superstrings were computed using the KmerCamel🐫 program, which experimentally implements local and global greedy heuristics for masked superstring computation using hash tables and the Aho-Corasick automaton.

By default, the superstrings computed by KmerCamel🐫 come with default masks; these contain the minimal possible number of 1's (i.e., every k-mer masked on only once) and the patterns of 1's and 0's reflect the orders in which individual k-mers were added to the superstrings. These default masks are denoted by D in the paper.

Importantly, changes in the underlying data structures (hash-table vs. AC automaton), as well as changing machines or compilers, results/may result in different superstrings and their mask, and the specific choices can affect mask compressibility. For instance, hash-table-based approaches tend to produce more regular masks that are better compressible (e.g., for nearly complete de Bruijn graphs).

Mask optimization

The masks were optimized using the scripts in the experiments/08_optimize_masks directory, which implement individual mask optimization strategies.

maskMinNumRuns.py. Minimization of the number of runs of 1's by integer programming using the PuLP solver, as described in Appendix H. (Denoted by R in the paper.)
maskMaxNumOnes.py. Maximization of the number of 1's in the mask, which is achieved by performing two passes through the superstring. In the first pass, the underlying k-mer K is computed, and in the second pass, all occurrences of k-mers from K are masked on. (Denoted by O in the paper.)
maskMaxNumZeros.py. Greedy minimization of the number of 1's (equivalent to maximization of the number of 0's), which is achieved by by performing one pass through the data, masking on the first occurrence of each k-mer while masking off all other occurrences. (Denoted by Z in the paper.)

Experimental evaluation

Benchmark datasets

We used four datasets for experimental evaluation (uploaded in the data/ directory):

S. pneumoniae genome (ATCC 700669, NC_011900.1, fna online)
- The resulting file: data/spneumoniae.fa.xz
S. pneumoniae pan-genome - 616 genomes, as provided in RASE DB S. pneumoniae
- k-mers were collected and stored in the form of simplitigs (ProphAsm v0.1.1, k=32, NS: 158,567, CL: 14,710,895 bp, #kmers: 9,795,318 32-mers)
- The resulting file: data/spneumo_pangenome_k32.fa.xz
S. cerevisiae genome (S288C, fna.gz online)
- data/yeast.fa.xz
SARS-CoV-2 pan-genome - downloaded from GISAID (access upon registration) on Jan 25, 2023 (GISAID version 2023_01_23, 14,682,066 genomes, 430 Gbp)
- k-mers were collected using JellyFish 2 (v2.2.10, 11,701,570 32-mers) and stored in the form of simplitigs (ProphAsm v0.1.1, k=32, NS: 345,866, CL: 22,423,416 bp, #kmers: 11,701,570 32-mers)
- The resulting file: data/sars-cov-2_pangenome_k32.fa.xz

The results in Figures 2 and 3 and Tables 1 and 2 were obtained using data in experiments/11_kmer_camel_comparison_v3/99_results/masked_superstrings_properties.kamenac.tsv.

Additionally, Tables 1 and 2 use data from experiments/12_tigs_stats/99_results/masked_superstrings_properties.kamenik.tsv.

Reproducing experimental results

After cloning this repository, run the following (besides standard Linux programs, it requires Snakemake and seqtk):

git submodule update --init # to download KmerCamel
cd experiments/11_kmer_camel_comparison_v3/
make test # a fast test that everything works
make
cd ../12_tigs_stats/
make

Figures + supplementary plots

Figures are created using Adobe Illustrator and combine individual subfigures generated using R by scripts provided in the respective directories.

Fig. 1 - Overview of masked superstrings (Sec. 2-3.6)

Fig. 1

Fig. 2 - Comparison of superstring heuristics (Sec. 3.7.1)

Directory
Fig. 2
Supplementary plots:
- Superstring characters per distinct k-mer - S.pneumoniae genome vs. pan-genome: Aho-Corasick, Hash tables
- Superstring compressibility (xz bits per distinct k-mer) - S.pneumoniae genome vs. pan-genome: Aho-Corasick, Hash tables

Fig. 3 - Comparison of mask heuristics (Sec. 3.7.2)

Directory
Fig. 3
Supplementary plots:
- Size of mask after compression (xz bits per distinct k-mer)
  - Hash table: S. pneumoniae genome, S. pneumoniae pan-genome, S. cerevisiae genome, SARS-CoV-2 pangenome
  - Aho-Corasick: S. pneumoniae genome, S. pneumoniae pan-genome, S. cerevisiae genome, SARS-CoV-2 pangenome
- Runs per superstring character
  - Hash table: S. pneumoniae genome, S. pneumoniae pan-genome, S. cerevisiae genome, SARS-CoV-2 pangenome
  - Aho-Corasick: S. pneumoniae genome, S. pneumoniae pan-genome, S. cerevisiae genome, SARS-CoV-2 pangenome

Remarks

11_kmer_camel_comparison_v3: Makefile specifies the number of cores for Snakemake, via argument -j. To test whether everything works, run make test first. To make it faster, limit the range of k in the Snakefile.
12_tigs_stats: uses compressed FASTA files computed by the matchtigs program

Contact

Karel Břinda ([email protected])
Pavel Veselý ([email protected])
Ondřej Sladký ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
experiments		experiments
figures		figures
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Masked superstrings – supplementary materials

Introduction

Citation

Methods

Superstring computation - KmerCamel🐫

Mask optimization

Experimental evaluation

Benchmark datasets

Reproducing experimental results

Figures + supplementary plots

Fig. 1 - Overview of masked superstrings (Sec. 2-3.6)

Fig. 2 - Comparison of superstring heuristics (Sec. 3.7.1)

Fig. 3 - Comparison of mask heuristics (Sec. 3.7.2)

Remarks

Contact

About

Releases

Packages

Contributors 2

Languages

License

karel-brinda/masked-superstrings-supplement

Folders and files

Latest commit

History

Repository files navigation

Masked superstrings – supplementary materials

Introduction

Citation

Methods

Superstring computation - KmerCamel🐫

Mask optimization

Experimental evaluation

Benchmark datasets

Reproducing experimental results

Figures + supplementary plots

Fig. 1 - Overview of masked superstrings (Sec. 2-3.6)

Fig. 2 - Comparison of superstring heuristics (Sec. 3.7.1)

Fig. 3 - Comparison of mask heuristics (Sec. 3.7.2)

Remarks

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages