Sonnet Tokenization Engine FAQ

What is Sonnet Tokenization Engine?

Many NLP systems operate on the individual tokens (typically words) of text instead of on the text as a whole. Sonnet Tokenization Engine breaks up input text into its individual tokens. For best results, the input text should be a single sentence. If you need to extract sentences see Prose Sentence Extraction Engine.

What is “tokenization” and what are “tokens”?

Tokenization is the process of breaking text into tokens. For example, given the sentence “The dog is black.”, the result of tokenization would be the individual tokens of the sentence: [“The”, “dog”, “is”, “black”]

What languages does Sonnet Tokenization Engine support?

Sonnet supports the languages listed below.

LanguageISO 639-3 Code
Arabicara
Belarusianbel
Bulgarianbul
Catalancat
Czechces
Danishdan
Germandeu
Modern Greekell
Englisheng
Estonianest
Finnishfin
Frenchfra
Irishgle
Hebrewheb
Hindihin
Croatianhrv
Hungarianhun
Indonesianind
Icelandicisl
Italianita
Japanesejpn
Koreankor
Latvianlav
Lithuanianlit
Macedonianmkd
Maltesemlt
Malaymsa
Dutchnld
Norwegiannor
Polishpol
Portuguesepor
Romanianron
Russianrus
Slovakslk
Sloveneslv
Spanishspa
Albaniansqi
Serbiansrp
Swedishswe
Thaitha
Turkishtur
Ukrainianukr
Vietnamesevie
Chinesezho

How do I use Sonnet Tokenization Engine?

Sonnet has a REST API. Simply submit your text to Sonnet via its API. For an example see the Quick Start.

How much does Sonnet Tokenization Engine cost?

Sonnet is free when used to tokenize English text. A license is required to tokenize other languages.

How do I get Sonnet Tokenization Engine?

Sonnet is available for download, on the AWS Marketplace, and on DockerHub. Get Sonnet Tokenization Engine.