EnrichFromLookup

A common requirement when ingesting data is to take the value of a document field and add further information to the document based on the value.

This processor provides a way to define lookups that depend on static data contained in a comma- or tab-separated file. The processor parses the data file and adds the data to a database so that queries can be performed quickly. The processor can then use the database to enrich documents.

For example, the default configuration includes a static data file containing postcode data. The processor reads a postcode from the document and adds fields containing a place name, latitude, and longitude. You can define your own data lookup by editing JSON configuration files - see Example Configuration.

Properties

Name Default Value Description
IDOL License Service  

An IdolLicenseServiceImpl that provides a way to communicate with a Knowledge Discovery License Server.

Lookup configuration directories   A list of directories that contain JSON configuration files that describe the lookup(s) to perform.
Lookup Identifiers   The names of the lookups to perform.

Relationships

Name Description
success Processing was successful.
failure Processing failed.

Example Configuration

The processor is supplied with a configuration file named data.json. This file describes the source data and specifies how to load the data into a database.

At the beginning of the file, the file_map object defines the static data files that contain the data to use:

"file_map": {
              "country_codes_csv_file": "country_codes.txt.gz",
              "postcode_GB_full_csv_file": "postcode_GB_full.txt.gz",
              "postcode_all_countries_csv_file": "postcode_all_countries.txt.gz"
        },

The csv_file object describes how to load the data from a comma- or tab- separated file into the database:

"csv_file": [
                {
                    "id": "postcode_GB_full",
                    "file_key": "postcode_GB_full_csv_file",
                    "use_header_row": false,
                    "separator": "\\t",
                    "headers": ["country", "postcode", "place", "name1", "code1", "name2", "code2", "name3", "code3", "latitude", "longitude", "accuracy"],
                    "names": [
                                {
                                    "name": "postcode",
                                    "expr": "${postcode:replace(' ', '')}"
                                },
                                "place", "name1", "name2", "name3", "latitude", "longitude"
                    ],
                    "indexes": [["postcode"]]
                },

In this structure:

  • id provides a friendly name that can be used later.
  • file_key matches one of the files defined in the file_map section.
  • use_header_row specifies whether the first line of the data file contains header/column names.
  • separator specifies the separator used in the data file. The example data is tab separated but the processor can also parse comma-separated files.
  • headers specifies header/column names (when use_header_row is false).
  • names specifies a list of columns to add to the database. In the example, this is a subset of the available data.

    • In this array you can specify either strings or objects with name and expr properties.

    • The name property specifies the name of a column to add to the database.

    • The expr property specifies the value to add in that column. This property accepts NiFi expression language, so that you can transform the data if necessary. The example removes any spaces from the postcodes before adding the postcodes to the database.

    • The string place is equivalent to the object:

      {
        "name": "place",
        "expr": "${place}"
      }
  • indexes specifies the names of the database columns to use for creating an index, to provide fast lookups. This example assumes that location data is to be found from a postcode, and optimizes the database for queries on that column.

Another JSON configuration file, lookup.json, defines how to perform a lookup and how to modify a Knowledge Discovery document.

The metadata object identifies the document fields to read/write:

 "metadata": [
                {
                    "id": 1,
                    "path": "${'property:location_metadata_xpath':isNull():ifElse('.',${'property:location_metadata_xpath'})}",
                    "fields": [
                            {
                                "name": "countrycode",
                                "path": "countrycode"
                            },
                            {
                                "name": "country",
                                "path": "country"
                            },
                            {
                                "name": "postcode",
                                "path": "postcode"
                            },
                            {
                                "name": "place",
                                "path": "place"
                            },
                            {
                                "name": "latitude",
                                "path": "latitude"
                            },
                            {
                                "name": "longitude",
                                "path": "longitude"
                            }
                        ]
                }
  • In this example, the first path property specifies that the fields to read and write are at the root of the document metadata, unless the dynamic property location_metadata_xpath has been set in the processor configuration, in which case that path is used instead.
  • In the fields array, the name properties are friendly names that can be used in the lookup section of the configuration file, described below. The path properties specify the path of a field to read or write (as an XPath expression), relative to the parent path.
  • fields objects can contain nested fields – the following example demonstrates how to read or write subfields of "field1" (named "subfield1" and "subfield2"), when "field1" has a subfield named "subfield3" with an attribute named "attr":

    "metadata": [
                    {
                        "id": 1,
                        "path": "myTopLevelField",
                        "fields": [
                            {
                                "id": 2,
                                "path": "field1/subfield3[@attr]/..",
                                "fields": [
                                    {
                                        "name": "subfield1",
                                        "path": "subfield1"
                                    },
                                    {
                                        "name": "subfield2",
                                        "path": "subfield2"
                                    }
                                ]
                            }
                        

The lookup section defines how to enrich a document:

 "lookup": [
              {
                  "id": "gb_postcode_to_place",
                  "metadata_id": 1,
                  "data_id": "postcode_GB_full",
                  "inputs": [
                          {
                                "name": "postcode",
                                "expr": "${postcode:replace(' ', '')}"
                          }
                  ],
                  "outputs": [
                          {
                                "name": "place",
                                "expr": "${literal('${place}, ${name3}, ${name2}, ${name1}'):replaceAll('^(, )+|(, )+(?=, |$)', '')}"
                          },
                          "latitude", "longitude"
                  ]
              },

In this structure:

  • id defines a friendly name for the lookup. This is what you specify in the Lookup Identifiers property of the processor, in the NiFi user interface.
  • metadata_id matches an object in the metadata section of the file, describing the fields to read/write.
  • data_id matches the ID of a database that contains the source data to use (see data.json).
  • inputs describes the field(s) to read from the document.

    • In this array you can specify either strings or objects with name and expr properties.
    • The expr property describes where to read the data from. In the example above, the data is read from the field with the friendly name postcode. The processor locates this field using the metadata section of the configuration file, by using the path that matches the name postcode. In this example, the value of the field is also transformed so that any spaces are removed (the expr property supports NiFi expression language).
    • The name property specifies the name of the column (in the database) to use for the lookup.
    • The string myfield is equivalent to the object:

      {
        "name": "myfield",
        "expr": "${myfield}"
      }
  • outputs specifies the data to add to the document.

    • In this array you can specify either strings or objects with name and expr properties.

    • The expr property defines the value for the document field. Any names used in the expr property must match database columns defined in the configuration (see the csv_file section in data.json). This property accepts NiFi expression language so that you can transform values if necessary.
    • The name property specifies the path of the metadata field to add to the document (or overwrite, if it already exists). The name must match a name described in the metadata section, below the metadata_id that you specified.
    • The string myfield is equivalent to the object:

      {
        "name": "myfield",
        "expr": "${myfield}"
      }