bert4torch.tokenizers module

Tokenization classes.

class bert4torch.tokenizers.BasicTokenizer(do_lower_case=True, never_split=('[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'))[source]

Runs basic tokenization (punctuation splitting, lower casing, etc.).

tokenize(text)[source]

文本切分成token

class bert4torch.tokenizers.SpTokenizer(sp_model_path, remove_space=True, keep_accents=False, do_lower_case=False, **kwargs)[source]

基于SentencePiece模型的封装,使用上跟Tokenizer基本一致。

decode(ids)[source]

转为可读文本

id_to_token(i)[source]

id转换为对应的token

preprocess_text(inputs)[source]

从transformers包的tokenization_xlnet移植过来,主要区别是对标点符号的处理

token_to_id(token)[source]

token转换为对应的id

class bert4torch.tokenizers.Tokenizer(token_dict, do_lower_case=True, do_basic_tokenize=True, do_tokenize_unk=False, **kwargs)[source]

Bert原生分词器

decode(ids, tokens=None)[source]

转为可读文本

id_to_token(id)[source]

id转为词表中的token

rematch(text, tokens)[source]

给出原始的text和tokenize后的tokens的映射关系

static stem(token)[source]

获取token的“词干”(如果是##开头,则自动去掉##)

token_to_id(token)[source]

token转为vocab中的id

class bert4torch.tokenizers.TokenizerBase(token_start='[CLS]', token_end='[SEP]', token_unk='[UNK]', token_pad='[PAD]', token_mask='[MASK]', add_special_tokens=None, pre_tokenize=None, token_translate=None)[source]

分词器基类

decode(ids)[source]

转为可读文本

encode(first_texts, second_texts=None, maxlen=None, pattern='S*E*E', truncate_from='right', return_offsets=False)[source]

可以处理多条或者单条

id_to_token(i)[source]

id序列为对应的token

ids_to_tokens(ids)[source]

id序列转换为对应的token序列

token_to_id(token)[source]

token转换为对应的id

tokenize(text, maxlen=None)[source]

分词函数

tokens_to_ids(tokens)[source]

token序列转换为对应的id序列

class bert4torch.tokenizers.Trie[source]

直接从transformer的tokenization_utils.py中移植, 主要是为了special_tokens分词

class bert4torch.tokenizers.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=100, do_tokenize_unk=False)[source]

Runs WordPiece tokenization.

tokenize(text)[source]

Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example:

input = “unaffable” output = [“un”, “##aff”, “##able”]

Args:
text: A single token or whitespace separated tokens. This should have

already been passed through BasicTokenizer.

Returns:

A list of wordpiece tokens.

bert4torch.tokenizers.convert_to_unicode(text)[source]

Converts text to Unicode (if it’s not already), assuming utf-8 input.

bert4torch.tokenizers.load_vocab(dict_path, encoding='utf-8', simplified=False, startswith=None)[source]

加载词典文件到dict

bert4torch.tokenizers.whitespace_tokenize(text)[source]

去除文本中的空白符