Use Named Entity Recognition
Named Entity Recognition is a useful part of pre-index processing, which allows you to create fields with extracted values automatically. If your documents have consistent, well-defined formats, you can set up a Named Entity Recognition task to extract different parts of the document to different fields. For example, if the documents are letters, you can extract the name, address, and date from each document into pre-defined fields.
You can also use Named Entity Recognition to extract information from your data that you want to use as metadata.
Named Entity Recognition is useful for processing information. After you have set it up, it automatically extracts information in a consistent way. The syntax that you use to define entities is very expressive.
One use of Named Entity Recognition is for extracting information from the results of Optical Character Recognition. You can use Named Entity Recognition to automatically extract metadata from the result data. Another major use of Named Entity Recognition is Sentiment Analysis.
Another use of Named Entity Recognition is for extracting information that might be personally identifiable information (PII), so that you can work out what user data you store for data protection regulations such as GDPR. OpenText provides a special package of PII grammars to help you find and extract PII from your data. For more information, refer to the Named Entity Recognition Grammars User Guide.
Choose the Entities to Extract
When you decide how to use Named Entity Recognition, consider your data, and how you want to be able to retrieve information. You should include all the entities that you are likely to want to search for, while minimizing the total number of values that you extract.
The Named Entity Recognition process uses additional indexing resources, and the extra document fields add to the index size and the time taken to index data. If you choose the minimum set of useful entities to extract, it gives you the most efficient index, which in turn makes retrieval more efficient.
Use Entities
When you have extracted entities, you can add them into document Fields. This process is known as Document Tagging.
The Content component has several ways of making retrieval more efficient for different types of data. You can use optimized field properties to make it quicker to filter and retrieve the values of the different entities.
How you use the entity tag fields depends partly on the entity. If you add each entity that you extract to a different field, you can treat each one in a different way, to maximize the value that you get, and minimize the cost in terms of index size and processing time.