dianna.utils.tokenizers

Classes

Tokenizer

Abstract base class for tokenizing.

SpacyTokenizer

Spacy tokenizer for natural language.

Functions

get_tokenizer(_spacy, name)

Get tokenizer using spacy.

Module Contents

class dianna.utils.tokenizers.Tokenizer(mask_token: str)[source]

Bases: abc.ABC

Abstract base class for tokenizing.

Has the same interface as (part of) the transformers Tokenizer class.

mask_token[source]
abstract tokenize(sentence: str) List[str][source]

Split sentence into list of tokens.

abstract convert_tokens_to_string(tokens: List[str]) str[source]

Merge list of tokens back to sentence.

class dianna.utils.tokenizers.SpacyTokenizer(name: str = 'en_core_web_sm', mask_token: str = 'UNKWORDZ')[source]

Bases: Tokenizer

Spacy tokenizer for natural language.

MATCH_token_unk_token[source]
MATCH_token_unk_white[source]
MATCH_white_unk_token[source]
spacy_tokenizer[source]
tokenize(sentence: str) List[str][source]

Tokenize sentence.

convert_tokens_to_string(tokens: List[str]) str[source]

Paste together with spaces in between.

_fix_whitespace(sentence: str)[source]

Apply fixes for the punctuation/special characters problem.

For more info, see: https://github.com/dianna-ai/dianna/issues/531

dianna.utils.tokenizers.get_tokenizer(_spacy, name)[source]

Get tokenizer using spacy.