tabby/crates/ctranslate2-bindings/ctranslate2/docs/decoding.md

# Decoding features

This page describes CTranslate2 decoding features. The text translation API is used for demonstration but most features are also available for text generation.

The examples use the following symbols that are left unspecified:

* `translator`: a [`ctranslate2.Translator`](python/ctranslate2.Translator.rst) instance
* `tokenize`: a function taking a string and returning a list of string
* `detokenize`: a function taking a list of string and returning a string

This `input` sentence will be used as an example:

> This project is geared towards efficient serving of standard translation models but is also a place for experimentation around model compression and inference acceleration.

## Greedy search

Greedy search is the most basic and fastest decoding strategy. It simply takes the token that has the highest probability at each timestep.

```python
results = translator.translate_batch([tokenize(input)], beam_size=1)
print(detokenize(results[0].hypotheses[0]))
```

> Dieses Projekt ist auf die effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet, aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

## Beam search

Beam search is a common decoding strategy for sequence models. The algorithm keeps N hypotheses at all times. This negatively impacts decoding speed and memory but allows finding a better final hypothesis.

```python
results = translator.translate_batch([tokenize(input)], beam_size=4)
print(detokenize(results[0].hypotheses[0]))
```

> Dieses Projekt ist auf die effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

```{tip}
More hypotheses can be returned by setting the `num_hypotheses` argument.
```

## Length constraints

The arguments `min_decoding_length` and `max_decoding_length` control the minimum and maximum number of tokens generated by the decoder. The length does not include the end of sequence token:

```python
results = translator.translate_batch([tokenize(input)], max_decoding_length=10)
assert len(results[0].hypotheses[0]) == 10
```

These length constraints do **not** apply to empty inputs. Empty inputs are not forwarded into the model and always return an empty output. This is why `min_decoding_length` is set by default to 1 as we expect non empty inputs to generate at least one token:

```python
results = translator.translate_batch([[]], min_decoding_length=1)
assert len(results[0].hypotheses[0]) == 0
```

```{attention}
By default, the input is truncated after 1024 tokens to limit the maximum memory usage of the model. See the option `max_input_length`.
```

## Autocompletion

The `target_prefix` argument can be used to force the start of the translation. Let's say we want to replace the first occurence of `die` by `das` in the translation:

```python
results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf das")],
)

print(detokenize(results[0].hypotheses[0]))
```

The prefix effectively changes the target context and the rest of the translation:

> Dieses Projekt ist auf das effiziente **Servieren** von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

## Biased decoding

Instead of using {ref}`decoding:autocompletion` to force a translation to start with a `target_prefix` argument, we can "bias" a translation towards a prefix by setting `prefix_bias_beta` to a value in (0, 1).  The higher `prefix_bias_beta` is, the stronger the bias. A translation can diverge from a prefix when `prefix_bias_beta` is low and the translator is confident in decoding tokens that are different from the prefix's tokens.  See [section 4.2](https://arxiv.org/abs/1912.03393) for more details on the biasing algorithm.

```python
results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf das")],
    prefix_bias_beta=0.5,
    beam_size=4,
)

print(detokenize(results[0].hypotheses[0]))
```

Setting `prefix_bias_beta=0.5` effectively enforces the `target_prefix` and changes the rest of the translation:

> Dieses Projekt ist auf das effiziente Servieren von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

```python
results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf das")],
    prefix_bias_beta=0.1,
    beam_size=4,
)

print(detokenize(results[0].hypotheses[0]))
```

Lowering the bias by setting `prefix_bias_beta=0.1` results in a divergence in the prefix from `das` to `die`:

> Dieses Projekt ist auf **die** effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

## Alternatives at a position

Combining `target_prefix` with the `return_alternatives` flag returns alternative sequences just after the prefix:

```python
results = translator.translate_batch(
    [tokenize(input)],
    target_prefix=[tokenize("Dieses Projekt ist auf die")],
    num_hypotheses=5,
    return_alternatives=True,
)

for hypothesis in results[0].hypotheses:
    print(detokenize(hypothesis))
```

> Dieses Projekt ist auf die **effiziente** Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.
>
> Dieses Projekt ist auf die **effektive** Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.
>
> Dieses Projekt ist auf die **effizientere** Bedienung von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.
>
> Dieses Projekt ist auf die **effizienten** Dienste von Standard-Übersetzungsmodellen ausgerichtet, aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.
>
> Dieses Projekt ist auf die **Effizienz** des Servierens von Standard-Übersetzungsmodellen ausgerichtet, ist aber auch ein Ort für Experimente rund um Modellkompression und Inferenzbeschleunigung.

In practice, the decoding extracts the `num_hypotheses` tokens that are most likely to appear after the target prefix. These tokens are then included in the prefix and the decoding completes each hypothesis independently.

```{tip}
The parameter `min_alternative_expansion_prob` can be used to filter out alternatives that are very unlikely. The expansion probability corresponds to the probability of the tokens that immediately follow the prefix. Try setting a small value like `min_alternative_expansion_prob=0.001` to filter out the most nonsensical alternatives.
```

## Random sampling

This decoding mode randomly samples tokens from the model output distribution. This strategy is frequently used in back-translation techniques ([Edunov et al. 2018](https://www.aclweb.org/anthology/D18-1045/)). The example below restricts the sampling to the best 10 candidates at each timestep and returns 3 random hypotheses:

```python
results = translator.translate_batch(
    [tokenize(input)],
    beam_size=1,
    sampling_topk=10,
    num_hypotheses=3,
)

for hypothesis in results[0].hypotheses:
    print(detokenize(hypothesis))
```

> Dieses Programm ist auf eine effiziente Bedienung von Standard-Übersetzungsmodellen ausgerichtet und ermöglicht gleichzeitig einen Einsatzort für Experimente rund um die Modellkompression oder das Beschleunigen der Schlussfolgerung.
>
> Es dient dazu, die standardisierten Übersetzungsmodelle effizient zu bedienen, aber auch zur Erprobung um die Formkomprimierung und die Folgebeschleunigung.
>
> Das Projekt richtet sich zwar auf den effizienten Service von Standard-Übersetzungen-Modellen, ist aber auch ein Ort für Experimente rund um Modellkomprimierung und ineffektive Beschleunigung.

```{tip}
You can increase the randomness of the generation by increasing the value of the argument `sampling_temperature`.
```