prepare_and_tokenize    Split Text on Spaces
prepare_text            Prepare Text for Tokenization
remove_control_characters
                        Remove Non-Character Characters
remove_diacritics       Remove Diacritical Marks on Characters
remove_replacement_characters
                        Remove the Unicode Replacement Character
space_cjk               Add Spaces Around CJK Ideographs
space_punctuation       Add Spaces Around Punctuation
squish_whitespace       Remove Extra Whitespace
tokenize_space          Break Text at Spaces
validate_utf8           Clean Up Text to UTF-8
