Skip to content

Commit f59ccd8

Browse files
committed
Add some more TLD to the tokenization RE (some of which actually get country code TLD after them as well) #1423
1 parent 4421213 commit f59ccd8

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

stanza/models/tokenization/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ def process_sentence(sentence, mwt_dict=None):
195195

196196
# https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
197197
# modification: disallow " as opposed to all ^\s
198-
URL_RAW_RE = r"""(?:https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s"]{2,}|www\.[a-zA-Z0-9]+\.[^\s"]{2,})|[a-zA-Z0-9]+\.com(?:\.[^\s"]{2,})?"""
198+
URL_RAW_RE = r"""(?:https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s"]{2,}|www\.[a-zA-Z0-9]+\.[^\s"]{2,})|[a-zA-Z0-9]+\.(?:gov|org|edu|net|com|co)(?:\.[^\s"]{2,})"""
199199

200200
MASK_RE = re.compile(f"(?:{EMAIL_RAW_RE}|{URL_RAW_RE})")
201201

0 commit comments

Comments
 (0)