Tag Documents into Clusters

After indexing, you can tag documents into clusters of similar documents. Tagging can be useful for grouping duplicate documents together.

Use the index action DRETAGDOCCLUSTERS. For information about the parameters and options available, refer to the Content Component Help.

DRETAGDOCCLUSTERS Example

The Content component indexes three documents:

#DREREFERENCE A
#DREDBNAME Default
#DREFIELD CHECKSUM="ABCD1234"
#DRECONTENT
apple banana cheese
#DREENDDOC

#DREREFERENCE B
#DREDBNAME Default
#DREFIELD CHECKSUM="ABCD1234"
#DRECONTENT
apple banana cheese
#DREENDDOC

#DREREFERENCE C
#DREDBNAME Default
#DREFIELD CHECKSUM="XYZ9876"
#DRECONTENT
apple banana data
#DREENDDOC

After indexing, you send the following action:

[...]/DRETAGDOCCLUSTERS?TagField=DOCUMENT/CLUSTERID&MinScore=60&TagSourceField=DOCUMENT/DREREFERENCE&MinID=1&MaxID=3&CheckSumField=DOCUMENT/CHECKSUM&TaggedDBName=tagged&RelevanceField=DOCUMENT/CLUSTERSCORE

The Content component modifies the data:

#DREREFERENCE A
#DREDBNAME Tagged
#DREFIELD CHECKSUM="ABCD1234"
#DREFIELD CLUSTERID="A"
#DREFIELD CLUSTERSCORE="100.00"
#DRECONTENT
apple banana cheese
#DREENDDOC

#DREREFERENCE B
#DREDBNAME Tagged
#DREFIELD CHECKSUM="ABCD1234"
#DREFIELD CLUSTERID="A"
#DREFIELD CLUSTERSCORE="100.00"
#DRECONTENT
apple banana cheese
#DREENDDOC

#DREREFERENCE C
#DREDBNAME Tagged
#DREFIELD CHECKSUM="XYZ9876"
#DREFIELD CLUSTERID="A"
#DREFIELD CLUSTERSCORE="70.00"
#DRECONTENT
apple banana data
#DREENDDOC

A is tagged as A because it does not match any existing clusters.

B is tagged as A because its CHECKSUM field matches A's.

C is tagged as A because it is similar to A and has a score higher than the specified MinScore (60).