Retrieve Content
Your organization is likely to have data in many different formats, distributed across many different repositories. Knowledge Discovery can help you make the most of your data, but the first step is to retrieve and enrich the data. This is called ingestion. Ingestion includes all of the processing that takes place before documents are indexed.
The ingestion process might include the following steps:
- Connect to a data repository and extract files and metadata.
- Extract the contents of container files such as zip files.
- Detect the format of each file and route it through appropriate processing tasks. Most files are filtered by File Content Extraction which extracts any text contained within the file. However, if a connector extracts an image file you might want to run Optical Character Recognition (OCR) to extract text from the image. If a connector extracts an audio or video file you might want to determine whether the file contains speech and, if so, transcribe the speech into text.
- Tag or categorize documents based on the information that they contain.
- Discard documents that do not contain useful information.
- Standardize metadata field names, so that documents retrieved from different data repositories have common properties such as a last modified date or author name in the same metadata fields.
- Index the resulting information into your Content component index.
The Knowledge Discovery product suite provides multiple ways to ingest information:
- NiFi Ingest. You can create an ingestion pipeline using NiFi Ingest, a set of Knowledge Discovery components for data retrieval and enrichment, that run within an open-source platform called Apache NiFi. With NiFi Ingest you can run your Connectors and data processing tasks inside Apache NiFi, and index documents into your Content index. NiFi Ingest is the recommended way to ingest data.
- Connectors and Connector Framework Server. You can run Connectors as standalone servers, which retrieve information from data repositories and send it to a Connector Framework Server (CFS). CFS processes documents and then indexes them into your Content index.
The following topics provide more information about ingestion:
-
Connectors. Extract information from different repositories.
-
Connector Framework Server. Process documents and index them into the Content component.
-
Deploy Connectors and CFS. Install, configure, and test your connectors and CFS.
-
OmniGroupServer. Configure OmniGroupServer for document security.