Feature Request: Speculative Decoding "acceptance rate" should not count drafts that were skipped via the " ignore small drafts" clause

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

I think the `slot.n_draft_total += draft.size()` should go after the "ignore small drafts" test here:

```cpp
                llama_tokens draft = common_speculative_gen_draft(slot.spec, params_spec, cached_text_tokens, id);

                // keep track of total number of tokens generated in the draft
                slot.n_draft_total += draft.size();

                // ignore small drafts
                if (slot.params.speculative.n_min > (int) draft.size()) {
                    SLT_DBG(slot, "ignoring small draft: %d < %d\n", (int) draft.size(), slot.params.speculative.n_min);

                    continue;
                }

                // construct the speculation batch
                common_batch_clear(slot.batch_spec);
                common_batch_add  (slot.batch_spec, id, slot.n_past, { slot.id }, true);

                for (size_t i = 0; i < draft.size(); ++i) {
                    common_batch_add(slot.batch_spec, draft[i], slot.n_past + 1 + i, { slot.id }, true);
                }

                SLT_DBG(slot, "decoding speculative batch, size = %d\n", slot.batch_spec.n_tokens);

                llama_decode(ctx, slot.batch_spec);

                // the accepted tokens from the speculation
                const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, draft);

                slot.n_past    += ids.size();
                slot.n_decoded += ids.size();

                // update how many tokens out of draft was accepted
                slot.n_draft_accepted += ids.size() - 1;
```

IMO, the "acceptance rate" should be the fraction of tokens we ***passed through the larger model*** that were accepted.

We could add another `slot.n_draft_generated` that counts how many tokens were generated by the draft model to print the existing stat, but since the speculative-decoding code can reuse drafts that were skipped via the "ignore small drafts" clause; it probably has limited usefulness.

(Not sure what to put this under as it might be classed as a bug,or a feature request if adding `slot.n_draft_generated`)

### Motivation

IMO, the "acceptance rate" should be the fraction of tokens we ***passed through the larger model*** that were accepted.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Speculative Decoding "acceptance rate" should not count drafts that were skipped via the " ignore small drafts" clause #14048

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Speculative Decoding "acceptance rate" should not count drafts that were skipped via the " ignore small drafts" clause #14048

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions