Rasa NLU supports a number of different languages. Exactly which ones depends on the backend you are using, and the features you require.
tensorflow_embedding pipeline in principle supports any language,
but only performs intent classification.
In addition, with the spaCy backend you can now load fastText vectors, which are available
for hundreds of languages.
For both intent and entity recognition, the following languages and backend combinations are tested and available:
These languages can be set as part of the Configuration.
Adding a new language¶
We want to make the process of adding new languages as simple as possible to increase the number of supported languages. Nevertheless, to use a language you either need a trained word representation or you need to train that presentation on your own using a large corpus of text data in that language.
These are the steps necessary to add a new language:
spaCy already provides a really good documentation page about Adding languages. This will help you train a tokenizer and vocabulary for a new language in spaCy.
As described in the documentation, you need to register your language using
set_lang_class() which will
allow Rasa NLU to load and use your new language by passing in your language identifier as the
language Configuration option.
- Get a ~clean language corpus (a Wikipedia dump works) as a set of text files
- Build and run MITIE Wordrep Tool on your corpus. This can take several hours/days depending on your dataset and your workstation. You’ll need something like 128GB of RAM for wordrep to run - yes that’s alot: try to extend your swap.
- Set the path of your new
total_word_feature_extractor.datas value of the mitie_file parameter in
Some notes about using the Jieba tokenizer together with MITIE on chinese
language data: To use it, you need a proper MITIE feature extractor, e.g.
data/total_word_feature_extractor_zh.dat. It should be trained
from a Chinese corpus using the MITIE wordrep tools
(takes 2-3 days for training).
For training, please build the MITIE Wordrep Tool. Note that Chinese corpus should be tokenized first before feeding into the tool for training. Close-domain corpus that best matches user case works best.