bert4torch.tokenizers module¶
Tokenization classes.
- class bert4torch.tokenizers.BasicTokenizer(do_lower_case=True, never_split=('[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'))[source]¶
Runs basic tokenization (punctuation splitting, lower casing, etc.).
- class bert4torch.tokenizers.SpTokenizer(sp_model_path, remove_space=True, keep_accents=False, do_lower_case=False, **kwargs)[source]¶
基于SentencePiece模型的封装,使用上跟Tokenizer基本一致。
- class bert4torch.tokenizers.Tokenizer(token_dict, do_lower_case=True, do_basic_tokenize=True, do_tokenize_unk=False, **kwargs)[source]¶
Bert原生分词器
- class bert4torch.tokenizers.TokenizerBase(token_start='[CLS]', token_end='[SEP]', token_unk='[UNK]', token_pad='[PAD]', token_mask='[MASK]', add_special_tokens=None, pre_tokenize=None, token_translate=None)[source]¶
分词器基类
- class bert4torch.tokenizers.Trie[source]¶
直接从transformer的tokenization_utils.py中移植, 主要是为了special_tokens分词
- class bert4torch.tokenizers.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=100, do_tokenize_unk=False)[source]¶
Runs WordPiece tokenization.
- tokenize(text)[source]¶
Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.
- For example:
input = “unaffable” output = [“un”, “##aff”, “##able”]
- Args:
- text: A single token or whitespace separated tokens. This should have
already been passed through BasicTokenizer.
- Returns:
A list of wordpiece tokens.
- bert4torch.tokenizers.convert_to_unicode(text)[source]¶
Converts text to Unicode (if it’s not already), assuming utf-8 input.