Description
UPDATE: RFC for new plugin
please use this issue as an RFC to have a new plugin under the opensearch-project
org for the phone number normalizer & analyzer). i have implemented & open-sourced the plugin (needs minor polishing for the git history & port to 3.x - this can be done in a few minutes once we know where it'll live) and would very much like to see it hosted & owned by the project as i believe in the general usefulness of this.
see this comment for more details: #11326 (comment)
original post:
Is your feature request related to a problem? Please describe.
we have a use-case where we store (amongst other things) a phone number in a dedicated field of the document. this is ingested from another system where in turn it has been entered by users (while there's some validation there might still be some variation in how the number is written). a user can then trigger a search which (amongst other things) will try to match the phone number. since the text to be searched is entered by the user, the phone number might come in any format (with or without international calling prefix, calling prefix with +
or 00
(or the national equivalent thereof), with or without separators (whitespaces, dashes, dots, you pick a character and chances are that a country is using it), with or without brackets for grouping numbers together, etc.).
as a corner case (doesn't really affect us, but relevant for a general solution): even e.g. just filtering for numbers doesn't work in case a number would be entered with alphabetical representation. the only one i actually know is 1-800-MICROSOFT
in the USA, but i think you have lots of these over there?
Describe the solution you'd like
it'd be great if OpenSearch could ship with a normalizer (or even a dedicated field type which automatically uses this normalizer?) for phone numbers which would cover most (if not all) cases. it could start with the most common ones and then be improved over time by the community when need arises.
Describe alternatives you've considered
everyone can build their own normalizer for phone numbers. the problem is that none of them will cover all (or even most) phone numbers and this just creates additional effort if everyone needs to re-invent the wheel.
the following is a very basic implementation which however doesn't cover most of the cases listed above (hence why it's hard to build a good one on your own):
{
"analysis": {
"char_filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": "\\s",
"replacement": ""
},
"transform_plus_to_00": {
"type": "pattern_replace",
"pattern": "\\+",
"replacement": "00"
}
},
"normalizer": {
"phone_number_normalizer": {
"type": "custom",
"char_filter": [
"whitespace_remove",
"transform_plus_to_00"
],
"filter": [
"lowercase",
"uppercase"
]
}
}
}
}
Additional context
the wikipedia article on national conventions for writing telephone numbers seems to cover most (if not all?) ways of writing phone numbers
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status