Skip to content

[RFC] new plugin with normalizer & analyzer for phone numbers #11326

Closed
@rursprung

Description

@rursprung

UPDATE: RFC for new plugin
please use this issue as an RFC to have a new plugin under the opensearch-project org for the phone number normalizer & analyzer). i have implemented & open-sourced the plugin (needs minor polishing for the git history & port to 3.x - this can be done in a few minutes once we know where it'll live) and would very much like to see it hosted & owned by the project as i believe in the general usefulness of this.
see this comment for more details: #11326 (comment)

original post:

Is your feature request related to a problem? Please describe.
we have a use-case where we store (amongst other things) a phone number in a dedicated field of the document. this is ingested from another system where in turn it has been entered by users (while there's some validation there might still be some variation in how the number is written). a user can then trigger a search which (amongst other things) will try to match the phone number. since the text to be searched is entered by the user, the phone number might come in any format (with or without international calling prefix, calling prefix with + or 00 (or the national equivalent thereof), with or without separators (whitespaces, dashes, dots, you pick a character and chances are that a country is using it), with or without brackets for grouping numbers together, etc.).

as a corner case (doesn't really affect us, but relevant for a general solution): even e.g. just filtering for numbers doesn't work in case a number would be entered with alphabetical representation. the only one i actually know is 1-800-MICROSOFT in the USA, but i think you have lots of these over there?

Describe the solution you'd like
it'd be great if OpenSearch could ship with a normalizer (or even a dedicated field type which automatically uses this normalizer?) for phone numbers which would cover most (if not all) cases. it could start with the most common ones and then be improved over time by the community when need arises.

Describe alternatives you've considered
everyone can build their own normalizer for phone numbers. the problem is that none of them will cover all (or even most) phone numbers and this just creates additional effort if everyone needs to re-invent the wheel.

the following is a very basic implementation which however doesn't cover most of the cases listed above (hence why it's hard to build a good one on your own):

{
  "analysis": {
    "char_filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": "\\s",
        "replacement": ""
      },
      "transform_plus_to_00": {
        "type": "pattern_replace",
        "pattern": "\\+",
        "replacement": "00"
      }
    },
    "normalizer": {
      "phone_number_normalizer": {
        "type": "custom",
        "char_filter": [
          "whitespace_remove",
          "transform_plus_to_00"
        ],
        "filter": [
          "lowercase",
          "uppercase"
        ]
      }
    }
  }
}

Additional context
the wikipedia article on national conventions for writing telephone numbers seems to cover most (if not all?) ways of writing phone numbers

Metadata

Metadata

Assignees

Labels

Roadmap:SearchProject-wide roadmap labelSearch:RelevanceenhancementEnhancement or improvement to existing feature or requestv2.18.0Issues and PRs related to version 2.18.0v3.0.0Issues and PRs related to version 3.0.0

Type

No type

Projects

Status

✅ Done

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions