Entity Extraction

Component Requires Model notes
ner_crf sklearn-crfsuite conditional random field good for training custom entities
ner_spacy spaCy averaged perceptron provides pre-trained entities
ner_duckling_http running duckling context-free grammar provides pre-trained entities
ner_mitie MITIE structured SVM good for training custom entities

Custom Entities

Almost every chatbot and voice app will have some custom entities. In a restaurant bot, chinese is a cuisine, but in a language-learning app it would mean something very different. The ner_crf component can learn custom entities in any language.

Extracting Places, Dates, People, Organisations

spaCy has excellent pre-trained named-entity recognisers for a few different langauges. You can test them out in this awesome interactive demo. We don’t recommend that you try to train your own NER using spaCy, unless you have a lot of data and know what you are doing. Note that some spaCy models are highly case-sensitive.

Dates, Amounts of Money, Durations, Distances, Ordinals

The duckling library does a great job of turning expressions like “next Thursday at 8pm” into actual datetime objects that you can use, e.g.

"next Thursday at 8pm"
=> {"value":"2018-05-31T20:00:00.000+01:00"}

The list of supported langauges is here. Duckling can also handle durations like “two hours”, amounts of money, distances, and ordinals. Fortunately, there is a duckling docker container ready to use, that you just need to spin up and connect to Rasa NLU. (see ner_duckling_http)

Regular Expressions (regex)

You can use regular expressions to help the CRF model learn to recognize entities. In the Training Data Format you can provide a list of regular expressions, each of which provides the ner_crf with an extra binary feature, which says if the regex was found (1) or not (0).

For example, the names of German streets often end in strasse. By adding this as a regex, we are telling the model to pay attention to words ending this way, and will quickly learn to associate that with a location entity.

If you just want to match regular expressions exactly, you can do this in your code, as a postprocessing step after receiving the response form Rasa NLU.

Returned Entities Object

In the object returned after parsing there are two fields that show information about how the pipeline impacted the entities returned. The extractor field of an entity tells you which entity extractor found this particular entity. The processors field contains the name of components that altered this specific entity.

The use of synonyms can also cause the value field not match the text exactly. Instead it will return the trained synonym.

{
  "text": "show me chinese restaurants",
  "intent": "restaurant_search",
  "entities": [
    {
      "start": 8,
      "end": 15,
      "value": "chinese",
      "entity": "cuisine",
      "extractor": "ner_crf",
      "confidence": 0.854,
      "processors": []
    }
  ]
}

Some extractors, like duckling, may include additional information. For example:

{
  "additional_info":{
    "grain":"day",
    "type":"value",
    "value":"2018-06-21T00:00:00.000-07:00",
    "values":[
      {
        "grain":"day",
        "type":"value",
        "value":"2018-06-21T00:00:00.000-07:00"
      }
    ]
  },
  "confidence":1.0,
  "end":5,
  "entity":"time",
  "extractor":"ner_duckling_http",
  "start":0,
  "text":"today",
  "value":"2018-06-21T00:00:00.000-07:00"
}

Note

The confidence will be set by the CRF entity extractor (ner_crf component). The duckling entity extractor will always return 1. The ner_spacy extractor does not provide this information and returns null.

Have questions or feedback?

We have a very active support community on Rasa Community Forum that is happy to help you with your questions. If you have any feedback for us or a specific suggestion for improving the docs, feel free to share it by creating an issue on Rasa NLU GitHub repository.