1.Data deduplication
SOLR supports data deduplication through the types of <Signature> classes. A Signature can be implemented in one of the following ways:
Method |
Describe |
Md5signature |
The 128-bit hash is used for replica detection resolution. |
Lookup3signature |
A 64-bit hash is used for replica detection resolution. Faster than MD5, with smaller indexes. |
Textprofilesignature |
near-duplicate detection from fuzzy hashing in Nutch . It is adjustable and has a good effect on long text field processing. |
Attention:
Adding replica processing will change the allowdups setting, so it is used for the update entry (where Signaturefield is used here) instead of updating the unique field's entry. Of course Signaturefield can be a unique field.
When a document is added, a message is automatically generated to connect to the document using the specified Signaturefield.
1.1 Configuration Options
Signatureupdateprocessorfactory is registered in Solrconfig.xml as Updaterequestprocessorchain:
<Updaterequestprocessorchainname= "Dedupe"> <Processorclass= "Solr.processor.SignatureUpdateProcessorFactory"> <BOOLname= "Enabled">True</BOOL> <Strname= "Signaturefield">Id</Str> <BOOLname= "Overwritedupes">False</BOOL> <Strname= "Fields">Name,features,cat</Str> <Strname= "Signatureclass">Solr.processor.Lookup3Signature</Str> </Processor></Updaterequestprocessorchain>
set |
|
Description |
signatureclass |
org.apache.solr.updat E.processor.lookup3signature |
generates a signature hash of signature implementation |
fi ELDs |
All fields |
the field to use and generate the signature hash in a comma Sep arated list. By default, all fields on the document would be used. |
signaturefield |
signaturefield |
field name To keep the fingerprint/signature. Make sure this field is defined in Schema.xml. |
enabled |
true |
enable/disable Replica Factory processing . |
1.2 In Schema.xml
If you use the specified field to store the signature, you must index the field.
<name= "signature" type= "string" stored= "true" indexed= "true" multivalued= "false"/>
Make sure to use the defined chain update handle:
<RequestHandlername= "/update"> <LSTname= "Defaults"> <Strname= "Update.chain">Dedupe</Str> </LST></RequestHandler>
Attention:
This update process can also be set by Update.chain=dedupe in the request parameters.
1.6.6 de-duplication (Data deduplication)