This command extracts entities from a document. It can print the output to a file, or to the console. You can use this option to test your grammars.
The following table describes the available parameters for this command.
-l <licensefile>
|
The file containing a valid license key for Eduction. If you do not specify a license key, edktool attempts to load the license |
-i <inputfile>
|
The file to perform entity extraction on. The input file can be either an IDOL IDX file, an IDOL XML file, or a plain text file. It must be UTF-8 encoded. NOTE: If the input file is an XML file, the configuration file must contain entries for the |
-c <configfile>
|
A configuration file controlling the extraction. See Eduction Configuration File. You can specify one or more grammar files and one or more entities in place of a configuration file. Specifying a configuration file overrides the grammar or entity parameters. |
-g <grammarfile>
|
A grammar file to use. If you have set a configuration file with If you provide a grammar file but do not specify any entities with You can use wildcard expressions in this parameter. See Wildcard Expressions in edktool. |
-e <entity>
|
The entities to extract. Separate multiple entities with a comma. If you have set a configuration file with You can use wildcard expressions in this parameter. See Wildcard Expressions in edktool. |
-o <outputfile>
|
The file to write the results of the extraction to. The content of the optional output file depends on the type of input file provided and whether you use the If the input file type is an IDOL file and the you do not use If the input file is a plain text file or an IDOL file with the If the input file is an IDOL file, the output file also contains document information. |
-m
|
Produce match results for IDOL input files. |
-q
|
(Optional) Run in quiet mode. In this case, edktool removes all descriptive messages from the output and shows the XML matchlist only (that is, an XML document with all the matches and any configured metadata).
|
-r <redaction_file>
|
(Optional) A copy of the input file, with all matches redacted. For example, if you specified an IDX input file, the content is sent to the redaction file as follows, with the redactions made in place:#DREREFERENCE 1 #DRECONTENT The driver ########## was questioned. #DREENDDOC |
-b
|
Set this parameter to read the input file in binary mode, rather than text mode. If you create a grammar file that matches entities with only Windows (CR LF) line endings and you run edktool on Windows, edktool must read the input file in binary mode for it to find any matches. Micro Focus recommends that you create grammar files capable of handling both Windows and Unix line endings. |
The extract option requires an input file (either in IDOL IDX, IDOL XML, or plain text format) and either a configuration file or a grammar file. If you do not provide a configuration file, edktool
searches the file for any specified entities in the specified grammar (or all entities, if none are specified). For example, in the simplest command line:
C:\>edktool e -i myData.txt -g grammar1.ecr,grammar2.ecr
This command invokjes edktool without a configuration file. It uses the command-line arguments to process the data file myData.txt
with the grammar files grammar1.ecr
and grammar2.ecr
. Eduction identifies all the entities in the two grammar files, and matches on these. The output is sent to the console in XML format, identifying matches in the data file and using the entity names to generate field names for the matches that contain the matched data. Assuming myData.txt
is a plain text file, the entire body of the file is matched.
You can enable redaction on extracted matches in edktool either by setting RedactedOutput
to True
in the edktool configuration file, or by specifying a redaction file using the -r
parameter at the command line.
NOTE: Edktool only performs redaction on fields that you have configured as IDOL search fields.
If you have specified an IDX file to perform extraction on, existing fields are preserved in their unredacted form, and a redacted copy of each search field is added to the IDX file, with _REDACTED
appended to the original field name. For example:
#DREREFERENCE 1 #DREFIELD DRECONTENT_REDACTED="The driver ########## was questioned." #DRECONTENT The driver Joe Bloggs was questioned. #DREENDDOC
If you have specified a plaintext file to perform extraction on, the entities identified as matches by edktool
are redacted from the input text to form the redacted output. For example:
Input:
The driver Joe Bloggs was questioned.
Output:
The driver ########## was questioned.
Eduction sends redacted output to the file specified in the -r
parameter. If you do not specify this argument but you have enabled redaction in the configuration file, Eduction displays redacted output in the console after the list of matches, unless you have specified the -q
parameter at the command line to enable quiet mode. In quiet mode, edktool does not display redacted output in the console.
edktool e -i myPlainTextFile.txt -g myGrammar.ecr
Extracts all entities in myGrammar.ecr
from myPlainTextFile.txt
, sending the output to the console in XML format, with the field names for the matching text automatically generated from the entity names found in myGrammar.ecr
.
edktool e -i myIDOLfile.idx -c myIDOLConfigFile.cfg –o myoutputfile.idx
Using the configuration file myIDOLConfigFile.cfg
, extract entities from the file myIDOLfile.idx
and direct the output with additional Eduction fields to the file myoutputfile.idx
.
edktool e -i myIDOLfile.idx -c myIDOLConfigFile.cfg –o myoutputfile.xml -m
The same as the previous example, except output the match results to an edktool
XML file.
|