Tag Documents into Clusters
After indexing, you can tag documents into clusters of similar documents. Tagging can be useful for grouping duplicate documents together.
Use the index action DRETAGDOCCLUSTERS. For information about the parameters and options available,
DRETAGDOCCLUSTERS Example
The Content component indexes three documents:
#DREREFERENCE A #DREDBNAME Default #DREFIELD CHECKSUM="ABCD1234" #DRECONTENT apple banana cheese #DREENDDOC #DREREFERENCE B #DREDBNAME Default #DREFIELD CHECKSUM="ABCD1234" #DRECONTENT apple banana cheese #DREENDDOC #DREREFERENCE C #DREDBNAME Default #DREFIELD CHECKSUM="XYZ9876" #DRECONTENT apple banana data #DREENDDOC
After indexing, you send the following action:
[...]/DRETAGDOCCLUSTERS?TagField=DOCUMENT/CLUSTERID&MinScore=60&TagSourceField=DOCUMENT/DREREFERENCE&MinID=1&MaxID=3&CheckSumField=DOCUMENT/CHECKSUM&TaggedDBName=tagged&RelevanceField=DOCUMENT/CLUSTERSCORE
The Content component modifies the data:
#DREREFERENCE A #DREDBNAME Tagged #DREFIELD CHECKSUM="ABCD1234" #DREFIELD CLUSTERID="A" #DREFIELD CLUSTERSCORE="100.00" #DRECONTENT apple banana cheese #DREENDDOC #DREREFERENCE B #DREDBNAME Tagged #DREFIELD CHECKSUM="ABCD1234" #DREFIELD CLUSTERID="A" #DREFIELD CLUSTERSCORE="100.00" #DRECONTENT apple banana cheese #DREENDDOC #DREREFERENCE C #DREDBNAME Tagged #DREFIELD CHECKSUM="XYZ9876" #DREFIELD CLUSTERID="A" #DREFIELD CLUSTERSCORE="70.00" #DRECONTENT apple banana data #DREENDDOC
A is tagged as A because it does not match any existing clusters.
B is tagged as A because its CHECKSUM
field matches A's.
C is tagged as A because it is similar to A and has a score higher than the specified MinScore
(60).