Assuming, that the Byte-Pair Encoding training would stop at this

Question

Assuming, that the Byte-Pair Encoding training would stop at this

point, the learned merge rules would then be applied to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance, the word "bug" would be tokenized to ["b", "ug"] but "mug" would be tokenized as ["<unk>", "ug"] since the symbol "m" is not in the base vocabulary. In general, single letters such as "m" are not replaced by the "<unk>" symbol because the training data usually includes at least one occurrence of each letter, but it is likely to happen for very special characters like emojis.

в доке вычитал это, то есть смайлик заменится на <unk>, по идее, но для всех токенайзеров ли это?

#nlp #programming #russian

0

26.10.2023

1 ответов

42 просмотра

Vladimir P · Accepted Answer

Vladimir P

не для всех, можно проверить самостоятельно, закодировать текст в токены и обратно декодировав

0

26.10.2023

170 похожих чатов

Assuming, that the Byte-Pair Encoding training would stop at this

1 ответов

Похожие вопросы