point, the learned merge rules would then be applied to new words (as long as those new words do not include symbols that were not in the base vocabulary). For instance, the word "bug" would be tokenized to ["b", "ug"] but "mug" would be tokenized as ["<unk>", "ug"] since the symbol "m" is not in the base vocabulary. In general, single letters such as "m" are not replaced by the "<unk>" symbol because the training data usually includes at least one occurrence of each letter, but it is likely to happen for very special characters like emojis.
в доке вычитал это, то есть смайлик заменится на <unk>, по идее, но для всех токенайзеров ли это?
не для всех, можно проверить самостоятельно, закодировать текст в токены и обратно декодировав
Обсуждают сегодня