dianna.utils.tokenizers
Module Contents
Classes
Abstract base class for tokenizing. |
|
Spacy tokenizer for natural language. |
- class dianna.utils.tokenizers.Tokenizer(mask_token: str)[source]
Bases:
abc.ABC
Abstract base class for tokenizing.
Has the same interface as (part of) the transformers Tokenizer class.
- class dianna.utils.tokenizers.SpacyTokenizer(name: str = 'en_core_web_sm', mask_token: str = 'UNKWORDZ')[source]
Bases:
Tokenizer
Spacy tokenizer for natural language.
- _fix_whitespace(sentence: str)[source]
Apply fixes for the punctuation/special characters problem.
For more info, see: https://github.com/dianna-ai/dianna/issues/531