dianna.utils.tokenizers

Module Contents

Classes

Tokenizer

Abstract base class for tokenizing.

SpacyTokenizer

Spacy tokenizer for natural language.

class dianna.utils.tokenizers.Tokenizer(mask_token: str)[source]

Bases: abc.ABC

Abstract base class for tokenizing.

Has the same interface as (part of) the transformers Tokenizer class.

abstract tokenize(sentence: str) List[str][source]

Split sentence into list of tokens.

abstract convert_tokens_to_string(tokens: List[str]) str[source]

Merge list of tokens back to sentence.

class dianna.utils.tokenizers.SpacyTokenizer(name: str = 'en_core_web_sm', mask_token: str = 'UNKWORDZ')[source]

Bases: Tokenizer

Spacy tokenizer for natural language.

MATCH_token_unk_token[source]
MATCH_token_unk_white[source]
MATCH_white_unk_token[source]
tokenize(sentence: str) List[str][source]

Tokenize sentence.

convert_tokens_to_string(tokens: List[str]) str[source]

Paste together with spaces in between.

_fix_whitespace(sentence: str)[source]

Apply fixes for the punctuation/special characters problem.

For more info, see: https://github.com/dianna-ai/dianna/issues/531