WARNING: Version 6.0 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Fingerprint Analyzeredit
The fingerprint
analyzer implements a
fingerprinting algorithm
which is used by the OpenRefine project to assist in clustering.
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
Definitionedit
It consists of:
- Tokenizer
- Token Filters (in order)
-
- Lower Case Token Filter
- ASCII Folding Token Filter
- Stop Token Filter (disabled by default)
- Fingerprint Token Filter
Example outputedit
POST _analyze { "analyzer": "fingerprint", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above sentence would produce the following single term:
[ and consistent godel is said sentence this yes ]
Configurationedit
The fingerprint
analyzer accepts the following parameters:
|
The character to use to concate the terms. Defaults to a space. |
|
The maximum token size to emit. Defaults to |
|
A pre-defined stop words list like |
|
The path to a file containing stop words. |
See the Stop Token Filter for more information about stop word configuration.
Example configurationedit
In this example, we configure the fingerprint
analyzer to use the
pre-defined list of English stop words:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } } } POST my_index/_analyze { "analyzer": "my_fingerprint_analyzer", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above example produces the following term:
[ consistent godel said sentence yes ]