`dianna.utils.tokenizers`

Module Contents

`Tokenizer`	Abstract base class for tokenizing.
`SpacyTokenizer`	Spacy tokenizer for natural language.

class dianna.utils.tokenizers.Tokenizer(mask_token: str)[source]

Abstract base class for tokenizing.

Has the same interface as (part of) the transformers Tokenizer class.

abstract tokenize(sentence: str) → List[str][source]: Split sentence into list of tokens.

abstract convert_tokens_to_string(tokens: List[str]) → str[source]: Merge list of tokens back to sentence.

class dianna.utils.tokenizers.SpacyTokenizer(name: str = 'en_core_web_sm', mask_token: str = 'UNKWORDZ')[source]

Spacy tokenizer for natural language.

convert_tokens_to_string(tokens: List[str]) → str[source]: Paste together with spaces in between.

_fix_whitespace(sentence: str)[source]

Apply fixes for the punctuation/special characters problem.