You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: SAMtags.tex
+166
Original file line number
Diff line number
Diff line change
@@ -471,6 +471,169 @@ \subsubsection{Color space}
471
471
Color read quality on the original strand of the read. Same encoding as {\sf QUAL}; same length as {\tt CS}.
472
472
\end{description}
473
473
474
+
\section{Draft tags}
475
+
476
+
These are tags which have been proposed and are broadly accepted to
477
+
become standard tags, but a review or probationary period has been
478
+
deemed useful. They use the locally-defined tag namespace and
479
+
processing software should consider that the tags may have local usage
480
+
for other purposes.
481
+
482
+
\begin{center}\small
483
+
% This table is sorted alphabetically
484
+
\begin{longtable}{ccp{12.5cm}}
485
+
\hline
486
+
{\bf Tag} & {\bf Type} & {\bf Description} \\
487
+
\hline
488
+
\endhead
489
+
{\tt Mm} & Z & Base modifications / methylation \\
490
+
% {\tt MP} & Z & Base modification qualities \\
491
+
{\tt Ml} & B,C & Base modification probabilities \\
492
+
\end{longtable}
493
+
\end{center}
494
+
495
+
\subsection{Base modifications}
496
+
497
+
Base modifications, including base methylation, are represented as a series of edits from the primary unmodified sequence as originally reported by the sequencing instrument.
498
+
This potentially differs to the sequence stored in the main SAM {\sf SEQ} field if the latter has been reverse complemented, in which case SAM {\sf FLAG} 0x10 must be set.
499
+
This means modification positions are also recorded against the original orientation (i.e. starting at the 5' end), and count the original base types.
500
+
501
+
Each modified base listed also has a quality value associated with it.
502
+
Given the unmodified base already has a phred likelihood, this base modification quality should be interpreted as the likelihood of this modification being correct given an assumption the original call is correct.
The first character is the unmodified ``fundamental'' base as reported
508
+
by the sequencing instrument for the top strand.
509
+
It must be one of {\tt A}, {\tt C}, {\tt G}, {\tt T}, {\tt U} (if RNA) or {\tt N} for anything else, including any IUPAC ambiguity codes in the reported SEQ field.
510
+
Note {\tt N} may be used to match any base rather than specifically an {\tt N} call by the sequencing instrument.
511
+
This may be used in situations where the base modification is not a derivation of a standard base type.
512
+
This is followed by either plus or minus indicating the strand the modification was observed on (relative to the original sequenced strand of {\sf SEQ} with plus meaning same orientation),\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt Mm} and {\tt Ml} tags.} and one or more base modification codes.
513
+
This is then followed by a comma separated list of how many unmodified seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
514
+
This number series is comparable to the numbers in an {\tt MD} tag,
515
+
albeit counting specific base types only and potentially reverse-complemented.
516
+
517
+
For example {\tt C+m,5,12,0;} tells us there are three
518
+
5-Methylcytosine bases on the top strand of {\sf SEQ}.
519
+
The first 5 {\tt C} bases are unmodified and the 6th is modified, as are the 19th (with 12 between the 6th and 19th) and 20th.
520
+
Similarly {\tt G-m,14;} indicates the 15th {\tt G} is a 5-Methylcytosine on the opposite strand (still counting using the top strand base calls from the 5' end).
521
+
When the alignment record is reverse complemented (SAM flag 0x10) these two examples do not change since the tag always refers to the as-sequenced orientation.
522
+
See the test/SAMtags/MM-orient.sam file for examples.
523
+
524
+
This permits modifications to be listed on either strand with the rare potential for both strands to have a modification at the same site.
525
+
If SAM FLAG 0x10 is set, indicating that SEQ has been reverse complemented from the sequence observed by the sequencing machine, note that these base modification field values will be in the opposite orientation to SEQ and other derived SAM fields.
526
+
527
+
Note it is permitted for the coordinate list to be empty (for example {\tt Mm:Z:C+m;}), which may be used as an explicit indicator that this base modification is not present.
528
+
It is not permitted for coordinates to be beyond the length of the sequence.
529
+
530
+
When multiple modifications are listed, for example {\tt C+mh,5,12,0;}, it indicates the modification may be any of the stated bases.
531
+
The associated confidence values in the {\tt Ml} tag may be used to determine the relative likelihoods between the options.
532
+
The example above is equivalent to {\tt C+m,5,12,0;C+h,5,12,0;}, although this will have a different ordering of confidence values in {\tt Ml}.
533
+
Note ChEBI codes cannot be used in the multi-modification form (such as the {\tt C+mh} example above).
534
+
535
+
If the modification is not one of the standard common types (listed below) it can be specified as a numeric ChEBI code.
536
+
For example {\tt C+76792,57;} is the same as {\tt C+h,57;}.
537
+
538
+
An unmodified base of {\tt N} means count any base in {\sf SEQ}, not only those of {\tt N}.
539
+
Thus {\tt N+n,100;} means the 101st base is Xanthosine (n), irrespective of the sequence composition.
540
+
541
+
The standard code types and their associated ChEBI values are listed
542
+
below, taken from Viner {\it et al.}%
543
+
\footnote{Coby Viner {\it et al.}, \emph{Modeling methyl-sensitive
544
+
transcription factor motifs with an expanded epigenetic alphabet}, \url{https://www.biorxiv.org/content/10.1101/043794v1}.}
% MP was the former quality score for MM. However being Phred scores
579
+
% it can only reasonable record probabilities for highly likely
580
+
% events, making it inappropriate for callers (eg ONT's) that wish to
581
+
% jointly call probabilities for the entire trained set of
582
+
% possibilities. We could use log-odds, similar to how early Illumina
583
+
% runs did to record likelihoods for A, C, G and T irrespective of
584
+
% call, but for now we're using linear-scaled probabilities. These
585
+
% are in the ML tag.
586
+
%
587
+
% The MP tag is left here for now as the jury is still out on whether
588
+
% we'll need it in the future.
589
+
%
590
+
% \item[MP:Z:\tagvalue{qualities}]
591
+
% \hfill\\
592
+
% The optional {\tt MP} tag lists the Phred qualities of each modification listed in the {\tt MM} tag in the order they occur.
593
+
% The qualities are encoded in the same manner as the primary {\sf QUAL} field; one byte per quality with ASCII value Phred score + 33.
594
+
% A space character (`{\tt \textvisiblespace}') should be used as a separator between concatenated quality strings when multiple modification lists are present in the {\tt MM} tag.
595
+
% The length should match the number of position deltas from {\tt MM} plus 1 per space character required.
596
+
%
597
+
% For example ``{\tt MM:Z:C+m,5,12,3;C+h,57;}'' may have an associated
598
+
% quality tag of ``{\tt MP:Z:5EB /}''.
599
+
%
600
+
% Where multiple modification types are listed together, such as in ``{\tt MM:Z:C+mh,5,12,3;}'' the quality values are interleaved in order ({\tt m} at 6, {\tt h} at 6, {\tt m} at 19, {\tt h} at 19 and so on), giving 6 quality values in total for this example.
601
+
%
602
+
% Quality values for ambiguity codes give the likelihood that the
603
+
% modification is one of the possible codes compatible with that
604
+
% ambiguity code. For example {\tt MM:Z:C+C,10; MP:Z:+} indicates a C
605
+
% call with an unspecified modification and the phred score of 10 (ASCII
606
+
% value {\tt +}). This corresponds to a 90\% chance of the base being
607
+
% modified.
608
+
%
609
+
% To represent several possible modifications at the same site the {\tt MP} tag can be used to indicate the probabilities of each possibility.
610
+
% The values used should be absolute probabilities, not relative between the alternatives.
611
+
% For example, a C base that has 95\% chance of being modified with 5mC being three times more likely than 5hmC will encode 5mC with 67.5\% probability ($0.9 * 0.75$ giving phred score 5, ASCII value {\tt \&})and 5hmC with 22.5\% probability ($0.9 * 0.25$ giving phred score 1, ASCII value {\tt "}).
612
+
% This could be represented with ``{\tt MM:Z:C+m,10;C+h,10; MP:Z:" \&}''.
613
+
614
+
\item[Ml:B:C,\tagvalue{scaled-probabilities}]
615
+
\hfill\\
616
+
The optional {\tt Ml} tag lists the probability of each modification listed in the {\tt Mm} tag being correct, in the order that they occur.
617
+
The continuous probability range 0.0 to 1.0 is remapped in equal
618
+
sized portions to the discrete integers 0 to 255 inclusively. Thus the
619
+
probability range corresponding to integer value $N$ is $N/256$ to
620
+
$(N+1)/256$.
621
+
622
+
The SAM encoding therefore uses a byte array of type `{\tt C}' with the number of elements matching the summation of the number of modifications listed as being present in the {\tt Mm} tag accounting for multi-modifications each having their own probability.
623
+
624
+
For example ``{\tt Mm:Z:C+m,5,12;C+h,5,12;}'' may have an associated tag of ``{\tt Ml:B:C,204,89,26,130}''.
625
+
626
+
If the above is rewritten in the multiple-modification form, the probabilities are interleaved in the order presented, giving ``{\tt Mm:Z:C+mh,5,12; Ml:B:C,204,26,89,130}''.
627
+
Note where several possible modifications are presented at the same site, the {\tt Ml} values represent the absolute probabilities of the modification call being correct and not the relative likelihood between the alternatives.
628
+
These probabilities should not sum to above 1.0 ($\approx256$ in integer encoding, allowing for some minor rounding errors), but may sum to a lower total with the remainder representing the probability that none of the listed modification types are present.
629
+
In the example used above, the 6th {\tt C} has 80\% chance of being {\tt 5mC}, 10\% chance of being {\tt 5hmC} and 10\% chance of being an unmodified {\tt C}.
630
+
631
+
{\tt Ml} values for ambiguity codes give the probability that the modification is one of the possible codes compatible with that ambiguity code.
632
+
For example {\tt Mm:Z:C+C,10; Ml:B:C,229} indicates a C call with a probability of 90\% of having some form of unspecified modification.
633
+
634
+
\end{description}
635
+
636
+
474
637
\section{Locally-defined tags}
475
638
476
639
You can freely add new tags.
@@ -492,6 +655,9 @@ \section{Tag History}
492
655
\setlength{\parindent}{0pt}
493
656
\newcommand*{\gap}{\vspace*{2ex}}
494
657
658
+
\subsubsection*{July 2021}
659
+
Added the Mm and Ml draft tags describing base modifications.
660
+
495
661
\subsubsection*{March 2020}
496
662
497
663
Transcript strand tag TS added, equivalent to the locally-defined XS tag
0 commit comments