-
Notifications
You must be signed in to change notification settings - Fork 909
Language code for Northern Sámi #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
While I certainly agree that being more consistent is better, the standard
we are consistent with for any language which has a UD dataset, we use the
abbreviation UD uses as the primary abbreviation. As it turns out, they
used `sme`:
sme_giella-ud-train.conllu
Now, we certainly could make it so that the language code gets
converted... would be happy to accept a PR for that, probably not going to
do that myself any time soon
…On Thu, Aug 31, 2023 at 1:32 PM Ye Lei (叶磊) ***@***.***> wrote:
*Describe the bug*
It seems that Stanza uses 2-digit ISO 639-1 code, if available, as the
default language code in the doc
<https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models>
and JSON file
<https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json>.
Otherwise, ISO 3-digit 639-3 language codes are used.
But for Northern Sámi, sme is used as the default language code while se
is available as the 2-digit ISO 639-1 language code.
*Expected behavior*
To be consistent with other languages, use se instead of sme as the
default language code for Northern Sámi.
*Environment (please complete the following information):*
- OS: [e.g. Windows, Ubuntu, CentOS, MacOS] Windows 11 x64
- Python version: [e.g. Python 3.6.8 from Anaconda] Python 3.10.12
- Stanza version: [e.g., 1.0.0] 1.5.0
—
Reply to this email directly, view it on GitHub
<#1279>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWINKMI35U52JBTOZ33XYDKC5ANCNFSM6AAAAAA4GPLHAE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Was this answer satisfactory? Not sure if can close this issue or needs more work |
Actually, I think this should be really easy to address - just make the resources file include |
For any language with a default code of 3 letters (as per universaldependencies), and an alternate code of 2 letters, we can add that langcode to the resources file to make an alias for people who expect the 2 letter code. Currently that only applies to se / sme (that we know of, at least)
For any language with a default code of 3 letters (as per universaldependencies), and an alternate code of 2 letters, we can add that langcode to the resources file to make an alias for people who expect the 2 letter code. Currently that only applies to se / sme (that we know of, at least)
Going forward, we'll regenerate the resources file with an alias which maps |
Describe the bug
It seems that
Stanza
uses 2-digit ISO 639-1 code, if available, as the default language code in the doc and JSON file. Otherwise, ISO 3-digit 639-3 language codes are used.But for Northern Sámi,
sme
is used as the default language code whilese
is available as the 2-digit ISO 639-1 language code.Expected behavior
To be consistent with other languages, use
se
instead ofsme
as the default language code for Northern Sámi.Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: