Skip to content

Language code for Northern Sámi #1279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BLKSerene opened this issue Aug 31, 2023 · 4 comments
Closed

Language code for Northern Sámi #1279

BLKSerene opened this issue Aug 31, 2023 · 4 comments
Labels

Comments

@BLKSerene
Copy link
Contributor

Describe the bug
It seems that Stanza uses 2-digit ISO 639-1 code, if available, as the default language code in the doc and JSON file. Otherwise, ISO 3-digit 639-3 language codes are used.

But for Northern Sámi, sme is used as the default language code while se is available as the 2-digit ISO 639-1 language code.

Expected behavior
To be consistent with other languages, use se instead of sme as the default language code for Northern Sámi.

Environment (please complete the following information):

  • OS: [e.g. Windows, Ubuntu, CentOS, MacOS] Windows 11 x64
  • Python version: [e.g. Python 3.6.8 from Anaconda] Python 3.10.12
  • Stanza version: [e.g., 1.0.0] 1.5.0
@BLKSerene BLKSerene added the bug label Aug 31, 2023
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Aug 31, 2023 via email

@AngledLuffa
Copy link
Collaborator

Was this answer satisfactory? Not sure if can close this issue or needs more work

@AngledLuffa
Copy link
Collaborator

Actually, I think this should be really easy to address - just make the resources file include se as an alias for sme. I'll see if I can make that happen. I can even update the resources file for the newly released 1.5.1 without changing the code

AngledLuffa added a commit to stanfordnlp/stanza-resources that referenced this issue Sep 10, 2023
AngledLuffa added a commit that referenced this issue Sep 10, 2023
For any language with a default code of 3 letters (as per universaldependencies), and an alternate code of 2 letters, we can add that langcode to the resources file to make an alias for people who expect the 2 letter code.

Currently that only applies to se / sme (that we know of, at least)
AngledLuffa added a commit that referenced this issue Sep 10, 2023
For any language with a default code of 3 letters (as per universaldependencies), and an alternate code of 2 letters, we can add that langcode to the resources file to make an alias for people who expect the 2 letter code.

Currently that only applies to se / sme (that we know of, at least)
@AngledLuffa
Copy link
Collaborator

Going forward, we'll regenerate the resources file with an alias which maps se to sme. Therefore, although we intend to keep using the UD abbreviation for North Sami, if you request a pipeline for se, it should successfully build one for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants