Language code for Northern Sámi #1279

BLKSerene · 2023-08-31T18:31:59Z

Describe the bug
It seems that Stanza uses 2-digit ISO 639-1 code, if available, as the default language code in the doc and JSON file. Otherwise, ISO 3-digit 639-3 language codes are used.

But for Northern Sámi, sme is used as the default language code while se is available as the 2-digit ISO 639-1 language code.

Expected behavior
To be consistent with other languages, use se instead of sme as the default language code for Northern Sámi.

Environment (please complete the following information):

OS: [e.g. Windows, Ubuntu, CentOS, MacOS] Windows 11 x64
Python version: [e.g. Python 3.6.8 from Anaconda] Python 3.10.12
Stanza version: [e.g., 1.0.0] 1.5.0

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2023-08-31T18:43:05Z

While I certainly agree that being more consistent is better, the standard we are consistent with for any language which has a UD dataset, we use the abbreviation UD uses as the primary abbreviation. As it turns out, they used `sme`: sme_giella-ud-train.conllu Now, we certainly could make it so that the language code gets converted... would be happy to accept a PR for that, probably not going to do that myself any time soon

…

On Thu, Aug 31, 2023 at 1:32 PM Ye Lei (叶磊) ***@***.***> wrote: *Describe the bug* It seems that Stanza uses 2-digit ISO 639-1 code, if available, as the default language code in the doc <https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models> and JSON file <https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json>. Otherwise, ISO 3-digit 639-3 language codes are used. But for Northern Sámi, sme is used as the default language code while se is available as the 2-digit ISO 639-1 language code. *Expected behavior* To be consistent with other languages, use se instead of sme as the default language code for Northern Sámi. *Environment (please complete the following information):* - OS: [e.g. Windows, Ubuntu, CentOS, MacOS] Windows 11 x64 - Python version: [e.g. Python 3.6.8 from Anaconda] Python 3.10.12 - Stanza version: [e.g., 1.0.0] 1.5.0 — Reply to this email directly, view it on GitHub <#1279>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWINKMI35U52JBTOZ33XYDKC5ANCNFSM6AAAAAA4GPLHAE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

AngledLuffa · 2023-09-09T22:25:22Z

Was this answer satisfactory? Not sure if can close this issue or needs more work

AngledLuffa · 2023-09-09T22:37:01Z

Actually, I think this should be really easy to address - just make the resources file include se as an alias for sme. I'll see if I can make that happen. I can even update the resources file for the newly released 1.5.1 without changing the code

For any language with a default code of 3 letters (as per universaldependencies), and an alternate code of 2 letters, we can add that langcode to the resources file to make an alias for people who expect the 2 letter code. Currently that only applies to se / sme (that we know of, at least)

AngledLuffa · 2023-09-10T00:32:34Z

Going forward, we'll regenerate the resources file with an alias which maps se to sme. Therefore, although we intend to keep using the UD abbreviation for North Sami, if you request a pipeline for se, it should successfully build one for you.

BLKSerene added the bug label Aug 31, 2023

AngledLuffa added a commit that referenced this issue Sep 9, 2023

Use an alias for se / sme, as per #1279

b811aa8

AngledLuffa added a commit that referenced this issue Sep 9, 2023

Use an alias for se / sme, as per #1279

570a58c

AngledLuffa added a commit to stanfordnlp/stanza-resources that referenced this issue Sep 10, 2023

Add an alias for se -> sme, such as might answer stanfordnlp/stanza#1279

4bcbae6

AngledLuffa added a commit that referenced this issue Sep 10, 2023

Use an alias for se / sme, as per #1279

147eb4a

AngledLuffa closed this as completed Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Language code for Northern Sámi #1279

Language code for Northern Sámi #1279

BLKSerene commented Aug 31, 2023

AngledLuffa commented Aug 31, 2023 via email

Uh oh!

AngledLuffa commented Sep 9, 2023

Uh oh!

AngledLuffa commented Sep 9, 2023

Uh oh!

AngledLuffa commented Sep 10, 2023

Uh oh!

Language code for Northern Sámi #1279

Language code for Northern Sámi #1279

Comments

BLKSerene commented Aug 31, 2023

AngledLuffa commented Aug 31, 2023 via email

Uh oh!

AngledLuffa commented Sep 9, 2023

Uh oh!

AngledLuffa commented Sep 9, 2023

Uh oh!

AngledLuffa commented Sep 10, 2023

Uh oh!