dianna.utils.tokenizers ======================= .. py:module:: dianna.utils.tokenizers Classes ------- .. autoapisummary:: dianna.utils.tokenizers.Tokenizer dianna.utils.tokenizers.SpacyTokenizer Functions --------- .. autoapisummary:: dianna.utils.tokenizers.get_tokenizer Module Contents --------------- .. py:class:: Tokenizer(mask_token: str) Bases: :py:obj:`abc.ABC` Abstract base class for tokenizing. Has the same interface as (part of) the transformers Tokenizer class. .. py:attribute:: mask_token .. py:method:: tokenize(sentence: str) -> List[str] :abstractmethod: Split sentence into list of tokens. .. py:method:: convert_tokens_to_string(tokens: List[str]) -> str :abstractmethod: Merge list of tokens back to sentence. .. py:class:: SpacyTokenizer(name: str = 'en_core_web_sm', mask_token: str = 'UNKWORDZ') Bases: :py:obj:`Tokenizer` Spacy tokenizer for natural language. .. py:attribute:: MATCH_token_unk_token .. py:attribute:: MATCH_token_unk_white .. py:attribute:: MATCH_white_unk_token .. py:attribute:: spacy_tokenizer .. py:method:: tokenize(sentence: str) -> List[str] Tokenize sentence. .. py:method:: convert_tokens_to_string(tokens: List[str]) -> str Paste together with spaces in between. .. py:method:: _fix_whitespace(sentence: str) Apply fixes for the punctuation/special characters problem. For more info, see: https://github.com/dianna-ai/dianna/issues/531 .. py:function:: get_tokenizer(_spacy, name) Get tokenizer using spacy.