Skip to content

Commit 9269474

Browse files
committed
Add Mm and Ml methylation tags to SAMtags.tex. PR#418
These are currently in draft form, as Mm and Ml, to be migrated to MM and ML in a month or so.
1 parent fd666f7 commit 9269474

12 files changed

+632
-0
lines changed

Makefile

+10
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,14 @@ clean: mostlyclean
6666
-rm -rf .jekyll-cache .jekyll-metadata _site
6767

6868

69+
# Checking of MM tag perl script
70+
check test:
71+
cd test/SAMtags; \
72+
for i in `echo *.sam | sed 's/\.sam//g'`; \
73+
do \
74+
./parse_mm.pl $$i.sam > _; \
75+
cmp _ $$i.txt; \
76+
done; \
77+
rm _
78+
6979
.PHONY: all pdf diff diffs show-styles mostlyclean clean

SAMtags.tex

+166
Original file line numberDiff line numberDiff line change
@@ -471,6 +471,169 @@ \subsubsection{Color space}
471471
Color read quality on the original strand of the read. Same encoding as {\sf QUAL}; same length as {\tt CS}.
472472
\end{description}
473473

474+
\section{Draft tags}
475+
476+
These are tags which have been proposed and are broadly accepted to
477+
become standard tags, but a review or probationary period has been
478+
deemed useful. They use the locally-defined tag namespace and
479+
processing software should consider that the tags may have local usage
480+
for other purposes.
481+
482+
\begin{center}\small
483+
% This table is sorted alphabetically
484+
\begin{longtable}{ccp{12.5cm}}
485+
\hline
486+
{\bf Tag} & {\bf Type} & {\bf Description} \\
487+
\hline
488+
\endhead
489+
{\tt Mm} & Z & Base modifications / methylation \\
490+
% {\tt MP} & Z & Base modification qualities \\
491+
{\tt Ml} & B,C & Base modification probabilities \\
492+
\end{longtable}
493+
\end{center}
494+
495+
\subsection{Base modifications}
496+
497+
Base modifications, including base methylation, are represented as a series of edits from the primary unmodified sequence as originally reported by the sequencing instrument.
498+
This potentially differs to the sequence stored in the main SAM {\sf SEQ} field if the latter has been reverse complemented, in which case SAM {\sf FLAG} 0x10 must be set.
499+
This means modification positions are also recorded against the original orientation (i.e. starting at the 5' end), and count the original base types.
500+
501+
Each modified base listed also has a quality value associated with it.
502+
Given the unmodified base already has a phred likelihood, this base modification quality should be interpreted as the likelihood of this modification being correct given an assumption the original call is correct.
503+
504+
\begin{description}
505+
\item[Mm:Z:\tagregex{([ACGTUN][-+]([a-z]+|[0-9]+)(,[0-9]+)*;)*}]
506+
\hfill\\
507+
The first character is the unmodified ``fundamental'' base as reported
508+
by the sequencing instrument for the top strand.
509+
It must be one of {\tt A}, {\tt C}, {\tt G}, {\tt T}, {\tt U} (if RNA) or {\tt N} for anything else, including any IUPAC ambiguity codes in the reported SEQ field.
510+
Note {\tt N} may be used to match any base rather than specifically an {\tt N} call by the sequencing instrument.
511+
This may be used in situations where the base modification is not a derivation of a standard base type.
512+
This is followed by either plus or minus indicating the strand the modification was observed on (relative to the original sequenced strand of {\sf SEQ} with plus meaning same orientation),\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt Mm} and {\tt Ml} tags.} and one or more base modification codes.
513+
This is then followed by a comma separated list of how many unmodified seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
514+
This number series is comparable to the numbers in an {\tt MD} tag,
515+
albeit counting specific base types only and potentially reverse-complemented.
516+
517+
For example {\tt C+m,5,12,0;} tells us there are three
518+
5-Methylcytosine bases on the top strand of {\sf SEQ}.
519+
The first 5 {\tt C} bases are unmodified and the 6th is modified, as are the 19th (with 12 between the 6th and 19th) and 20th.
520+
Similarly {\tt G-m,14;} indicates the 15th {\tt G} is a 5-Methylcytosine on the opposite strand (still counting using the top strand base calls from the 5' end).
521+
When the alignment record is reverse complemented (SAM flag 0x10) these two examples do not change since the tag always refers to the as-sequenced orientation.
522+
See the test/SAMtags/MM-orient.sam file for examples.
523+
524+
This permits modifications to be listed on either strand with the rare potential for both strands to have a modification at the same site.
525+
If SAM FLAG 0x10 is set, indicating that SEQ has been reverse complemented from the sequence observed by the sequencing machine, note that these base modification field values will be in the opposite orientation to SEQ and other derived SAM fields.
526+
527+
Note it is permitted for the coordinate list to be empty (for example {\tt Mm:Z:C+m;}), which may be used as an explicit indicator that this base modification is not present.
528+
It is not permitted for coordinates to be beyond the length of the sequence.
529+
530+
When multiple modifications are listed, for example {\tt C+mh,5,12,0;}, it indicates the modification may be any of the stated bases.
531+
The associated confidence values in the {\tt Ml} tag may be used to determine the relative likelihoods between the options.
532+
The example above is equivalent to {\tt C+m,5,12,0;C+h,5,12,0;}, although this will have a different ordering of confidence values in {\tt Ml}.
533+
Note ChEBI codes cannot be used in the multi-modification form (such as the {\tt C+mh} example above).
534+
535+
If the modification is not one of the standard common types (listed below) it can be specified as a numeric ChEBI code.
536+
For example {\tt C+76792,57;} is the same as {\tt C+h,57;}.
537+
538+
An unmodified base of {\tt N} means count any base in {\sf SEQ}, not only those of {\tt N}.
539+
Thus {\tt N+n,100;} means the 101st base is Xanthosine (n), irrespective of the sequence composition.
540+
541+
The standard code types and their associated ChEBI values are listed
542+
below, taken from Viner {\it et al.}%
543+
\footnote{Coby Viner {\it et al.}, \emph{Modeling methyl-sensitive
544+
transcription factor motifs with an expanded epigenetic alphabet}, \url{https://www.biorxiv.org/content/10.1101/043794v1}.}
545+
Additionally ambiguity codes {\tt A}, {\tt C}, {\tt G}, {\tt T} and {\tt U}
546+
exist to represent unspecified modifications bases of their respective
547+
canonical base types, plus code {\tt N} to represent an unspecified
548+
modification of any base type.
549+
550+
\begin{center}
551+
\begin{tabular}{lllll}
552+
{\bf Unmodified base} & {\bf Code} & {\bf Abbreviation} & {\bf Name} & {\bf ChEBI} \\
553+
\hline
554+
C & m & 5mC & 5-Methylcytosine & 27551 \\
555+
C & h & 5hmC & 5-Hydroxymethylcytosine & 76792 \\
556+
C & f & 5fC & 5-Formylcytosine & 76794 \\
557+
C & c & 5caC & 5-Carboxylcytosine & 76793 \\
558+
C & C & & Ambiguity code; any C mod & \\
559+
\hline
560+
T & g & 5hmU & 5-Hydroxymethyluracil & 16964 \\
561+
T & e & 5fU & 5-Formyluracil & 80961 \\
562+
T & b & 5caU & 5-Carboxyluracil & 17477 \\
563+
T & T & & Ambiguity code; any T mod & \\
564+
\hline
565+
U & U & & Ambiguity code; any U mod & \\
566+
\hline
567+
A & a & 6mA & 6-Methyladenine & 28871 \\
568+
A & A & & Ambiguity code; any A mod & \\
569+
\hline
570+
G & o & 8oxoG & 8-Oxoguanine & 44605 \\
571+
G & G & & Ambiguity code; any G mod & \\
572+
\hline
573+
N & n & Xao & Xanthosine & 18107 \\
574+
N & N & & Ambiguity code; any mod & \\
575+
\end{tabular}
576+
\end{center}
577+
578+
% MP was the former quality score for MM. However being Phred scores
579+
% it can only reasonable record probabilities for highly likely
580+
% events, making it inappropriate for callers (eg ONT's) that wish to
581+
% jointly call probabilities for the entire trained set of
582+
% possibilities. We could use log-odds, similar to how early Illumina
583+
% runs did to record likelihoods for A, C, G and T irrespective of
584+
% call, but for now we're using linear-scaled probabilities. These
585+
% are in the ML tag.
586+
%
587+
% The MP tag is left here for now as the jury is still out on whether
588+
% we'll need it in the future.
589+
%
590+
% \item[MP:Z:\tagvalue{qualities}]
591+
% \hfill\\
592+
% The optional {\tt MP} tag lists the Phred qualities of each modification listed in the {\tt MM} tag in the order they occur.
593+
% The qualities are encoded in the same manner as the primary {\sf QUAL} field; one byte per quality with ASCII value Phred score + 33.
594+
% A space character (`{\tt \textvisiblespace}') should be used as a separator between concatenated quality strings when multiple modification lists are present in the {\tt MM} tag.
595+
% The length should match the number of position deltas from {\tt MM} plus 1 per space character required.
596+
%
597+
% For example ``{\tt MM:Z:C+m,5,12,3;C+h,57;}'' may have an associated
598+
% quality tag of ``{\tt MP:Z:5EB /}''.
599+
%
600+
% Where multiple modification types are listed together, such as in ``{\tt MM:Z:C+mh,5,12,3;}'' the quality values are interleaved in order ({\tt m} at 6, {\tt h} at 6, {\tt m} at 19, {\tt h} at 19 and so on), giving 6 quality values in total for this example.
601+
%
602+
% Quality values for ambiguity codes give the likelihood that the
603+
% modification is one of the possible codes compatible with that
604+
% ambiguity code. For example {\tt MM:Z:C+C,10; MP:Z:+} indicates a C
605+
% call with an unspecified modification and the phred score of 10 (ASCII
606+
% value {\tt +}). This corresponds to a 90\% chance of the base being
607+
% modified.
608+
%
609+
% To represent several possible modifications at the same site the {\tt MP} tag can be used to indicate the probabilities of each possibility.
610+
% The values used should be absolute probabilities, not relative between the alternatives.
611+
% For example, a C base that has 95\% chance of being modified with 5mC being three times more likely than 5hmC will encode 5mC with 67.5\% probability ($0.9 * 0.75$ giving phred score 5, ASCII value {\tt \&})and 5hmC with 22.5\% probability ($0.9 * 0.25$ giving phred score 1, ASCII value {\tt "}).
612+
% This could be represented with ``{\tt MM:Z:C+m,10;C+h,10; MP:Z:" \&}''.
613+
614+
\item[Ml:B:C,\tagvalue{scaled-probabilities}]
615+
\hfill\\
616+
The optional {\tt Ml} tag lists the probability of each modification listed in the {\tt Mm} tag being correct, in the order that they occur.
617+
The continuous probability range 0.0 to 1.0 is remapped in equal
618+
sized portions to the discrete integers 0 to 255 inclusively. Thus the
619+
probability range corresponding to integer value $N$ is $N/256$ to
620+
$(N+1)/256$.
621+
622+
The SAM encoding therefore uses a byte array of type `{\tt C}' with the number of elements matching the summation of the number of modifications listed as being present in the {\tt Mm} tag accounting for multi-modifications each having their own probability.
623+
624+
For example ``{\tt Mm:Z:C+m,5,12;C+h,5,12;}'' may have an associated tag of ``{\tt Ml:B:C,204,89,26,130}''.
625+
626+
If the above is rewritten in the multiple-modification form, the probabilities are interleaved in the order presented, giving ``{\tt Mm:Z:C+mh,5,12; Ml:B:C,204,26,89,130}''.
627+
Note where several possible modifications are presented at the same site, the {\tt Ml} values represent the absolute probabilities of the modification call being correct and not the relative likelihood between the alternatives.
628+
These probabilities should not sum to above 1.0 ($\approx 256$ in integer encoding, allowing for some minor rounding errors), but may sum to a lower total with the remainder representing the probability that none of the listed modification types are present.
629+
In the example used above, the 6th {\tt C} has 80\% chance of being {\tt 5mC}, 10\% chance of being {\tt 5hmC} and 10\% chance of being an unmodified {\tt C}.
630+
631+
{\tt Ml} values for ambiguity codes give the probability that the modification is one of the possible codes compatible with that ambiguity code.
632+
For example {\tt Mm:Z:C+C,10; Ml:B:C,229} indicates a C call with a probability of 90\% of having some form of unspecified modification.
633+
634+
\end{description}
635+
636+
474637
\section{Locally-defined tags}
475638

476639
You can freely add new tags.
@@ -492,6 +655,9 @@ \section{Tag History}
492655
\setlength{\parindent}{0pt}
493656
\newcommand*{\gap}{\vspace*{2ex}}
494657

658+
\subsubsection*{July 2021}
659+
Added the Mm and Ml draft tags describing base modifications.
660+
495661
\subsubsection*{March 2020}
496662

497663
Transcript strand tag TS added, equivalent to the locally-defined XS tag

test/SAMtags/MM-chebi.sam

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
@CO Separate m, h and N modifications
2+
* 0 * 0 0 * * 0 0 AGCTCTCCAGAGTCGNACGCCATYCGCGCGCCACCA * Mm:Z:C+m,2,2,1,4,1;C+76792,6,7;N+n,15; Ml:B:C,102,128,153,179,204,161,187,212,169

test/SAMtags/MM-chebi.txt

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
A T
2+
G C
3+
C G
4+
T A
5+
C G
6+
T A
7+
Cm40 G
8+
C G
9+
A T
10+
G C
11+
A T
12+
G C
13+
T A
14+
C G
15+
G C
16+
Nn83 N
17+
A T
18+
Cm50 G
19+
G C
20+
C(76792)63 G
21+
Cm59 G
22+
A T
23+
T A
24+
Y R
25+
C G
26+
G C
27+
C G
28+
G C
29+
C G
30+
G C
31+
C G
32+
Cm70 G
33+
A T
34+
C G
35+
Cm79(76792)73 G
36+
A T

test/SAMtags/MM-double.sam

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
@CO Modifications called on both strands of the same record,
2+
@CO including potentially at the same location simultaneously.
3+
* 0 * 0 0 * * 0 0 AGGATCTCTAGCGGATCGGCGGGGGATATGCCATAT * Mm:Z:C+m,1,3,0;G-m,0,2,0,4;G+o,4; Ml:B:C,128,153,179,115,141,166,192,102

test/SAMtags/MM-double.txt

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
A T
2+
G Cm45
3+
G C
4+
A T
5+
T A
6+
C G
7+
T A
8+
Cm50 G
9+
T A
10+
A T
11+
G C
12+
C G
13+
G Cm55
14+
Go40 Cm65
15+
A T
16+
T A
17+
C G
18+
G C
19+
G C
20+
C G
21+
G C
22+
G C
23+
G Cm75
24+
G C
25+
G C
26+
A T
27+
T A
28+
A T
29+
T A
30+
G C
31+
Cm59 G
32+
Cm70 G
33+
A T
34+
T A
35+
A T
36+
T A

test/SAMtags/MM-multi.sam

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
@CO Testing multiple m, h and N modifications on the same read.
2+
@CO r1 has them separated out.
3+
@CO r2 has them combined together, for example as produced by
4+
@CO a joint basecaller which assigns probabilities to all
5+
@CO trained events simultaneously.
6+
r1 0 * 0 0 * * 0 0 AGCTCTCCAGAGTCGNACGCCATYCGCGCGCCACCA * Mm:Z:C+m,2,2,1,4,1;C+h,6,7;N+n,15,2; Ml:B:C,128,153,179,204,230,159,6,215,240
7+
r2 0 * 0 0 * * 0 0 AGCTCTCCAGAGTCGNACGCCATYCGCGCGCCACCA * Mm:Z:C+mh,2,2,0,0,4,1;N+n,15; Ml:B:C,77,159,103,133,128,108,154,82,179,57,204,31,240

test/SAMtags/MM-multi.txt

+73
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
A T
2+
G C
3+
C G
4+
T A
5+
C G
6+
T A
7+
Cm50 G
8+
C G
9+
A T
10+
G C
11+
A T
12+
G C
13+
T A
14+
C G
15+
G C
16+
Nn84 N
17+
A T
18+
Cm59 G
19+
Gn93 C
20+
Ch62 G
21+
Cm70 G
22+
A T
23+
T A
24+
Y R
25+
C G
26+
G C
27+
C G
28+
G C
29+
C G
30+
G C
31+
C G
32+
Cm79 G
33+
A T
34+
C G
35+
Cm90h2 G
36+
A T
37+
38+
A T
39+
G C
40+
C G
41+
T A
42+
C G
43+
T A
44+
Cm30h62 G
45+
C G
46+
A T
47+
G C
48+
A T
49+
G C
50+
T A
51+
C G
52+
G C
53+
Nn93 N
54+
A T
55+
Cm40h52 G
56+
G C
57+
Cm50h42 G
58+
Cm60h32 G
59+
A T
60+
T A
61+
Y R
62+
C G
63+
G C
64+
C G
65+
G C
66+
C G
67+
G C
68+
C G
69+
Cm70h22 G
70+
A T
71+
C G
72+
Cm79h12 G
73+
A T

test/SAMtags/MM-orient.sam

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
@CO Testing mods on top and bottom strand, but also in
2+
@CO original vs reverse-complemented orientation
3+
top-fwd 0 * 0 0 * * 0 0 AGGATCTCTAGCGGATCGGCGGGGGATATGCCATAT * Mm:Z:C+m,1,3,0; Ml:B:C,128,153,179
4+
top-rev 16 * 0 0 * * 0 0 ATATGGCATATCCCCCGCCGATCCGCTAGAGATCCT * Mm:Z:C+m,1,3,0; Ml:B:C,128,153,179
5+
bot-fwd 0 * 0 0 * * 0 0 AGGATCTCTAGCGGATCGGCGGGGGATATGCCATAT * Mm:Z:G-m,0,0,4,3; Ml:B:C,115,141,166,192
6+
bot-rev 16 * 0 0 * * 0 0 ATATGGCATATCCCCCGCCGATCCGCTAGAGATCCT * Mm:Z:G-m,0,0,4,3; Ml:B:C,115,141,166,192

0 commit comments

Comments
 (0)