dianna.utils.tokenizers
Classes
Abstract base class for tokenizing. |
|
Spacy tokenizer for natural language. |
Functions
|
Get tokenizer using spacy. |
Module Contents
- class dianna.utils.tokenizers.Tokenizer(mask_token: str)[source]
Bases:
abc.ABCAbstract base class for tokenizing.
Has the same interface as (part of) the transformers Tokenizer class.
- class dianna.utils.tokenizers.SpacyTokenizer(name: str = 'en_core_web_sm', mask_token: str = 'UNKWORDZ')[source]
Bases:
TokenizerSpacy tokenizer for natural language.
- _fix_whitespace(sentence: str)[source]
Apply fixes for the punctuation/special characters problem.
For more info, see: https://github.com/dianna-ai/dianna/issues/531