[open] Character handling flaw

Rhyono · 02/12/19, 12:57 AM

I haven't figured out what is special about it yet, but the "Latin small letter A with grave" (this guy: à) works fine in all text fields:

Lua Code:

/script d('àààààààààààààààààààààààààà')

However, the second half of it gets treated as a space character which results in the code below splitting the string:

Lua Code:

/script inStr = "1à1";for outStr in inStr:gmatch("%S+") do d(outStr) end

The lua website's demo, however, does not have this problem:

Lua Code:

inStr = "1à1"
for outStr in inStr:gmatch("%S+") do
    print(outStr)
end

I'm not sure how customized your lua implementation is, but do you fix issues like these?

sirinsidiator · 02/12/19, 06:04 AM

This is not a bug, but simply an encoding issue. The Lua string functions assume your input sequence is ASCII, but you used UTF8 for your .lua file.
This means the à character in your test corresponds to the two byte sequence "c3 a0" instead of "e0". According to https://www.ascii-code.com/ "c3" is "Latin capital letter A with tilde" and "a0" "Non-breaking space". The game font cannot properly render the first one since it uses utf8 instead of ASCII, so it shows a box instead and the space is handled by gmatch.
Try to convert your .lua file to ASCII and it should work as expected although it will break any "real" UTF8 strings you use and the letter will be rendered as a box unless you use a custom font.

votan · 02/13/19, 03:08 AM

Welcome to the hell of localization.

à is at least part of the extended ascii code. But think about russian or japanese players

http://lua-users.org/wiki/LuaUnicode

merlight · 02/14/19, 06:56 AM

Originally Posted by sirinsidiator

This is not a bug, but simply an encoding issue.

If this is not a bug, it's a terrible feature.

TL;DR: When you have all strings in the game in UTF-8, your string handling functions should not operate in LATIN-1.

Originally Posted by sirinsidiator

The Lua string functions assume your input sequence is ASCII, but you used UTF8 for your .lua file.

Lua, as in pure Lua 5.1 interpreter, most likely works with C locale, which means string functions only work on individual bytes and assume ASCII and don't care about byte values 128 and above. Because of this, string.find("\195\160", "%s") returns nil ... neither 195 nor 160 represent any character in ASCII, and so cannot match "%s" (space).

Enter ESOLua, modified interpreter. Despite the fact that strings in the ESO API are, for obvious reasons, UTF-8 encoded, string matching functions treat strings as LATIN-1 encoded. Therefore, string.find("\195\160", "%s") returns 2, matching the trailing byte of this two-byte character (in LATIN-1, 160 is a space character). This is BOLLOCKS.

Originally Posted by sirinsidiator

This means the à character in your test corresponds to the two byte sequence "c3 a0" instead of "e0". According to https://www.ascii-code.com/ "c3" is "Latin capital letter A with tilde" and "a0" "Non-breaking space".

ASCII is a 7-bit encoding. The described meaning of "c3" and "a0" comes from LATIN-1, which is a superset of ASCII, but not ASCII.

Originally Posted by sirinsidiator

The game font cannot properly render the first one since it uses utf8 instead of ASCII, so it shows a box

It's not a font issue, it's because "c3" is not a valid UTF-8 sequence. I don't know why the OP's client renders tofu, mine didn't render anything, but either way "c3" with no trailing byte in UTF-8 sequence is an error, not a character.

Originally Posted by sirinsidiator

and the space is handled by gmatch.

And that's the problem. It's a space only for gmatch assuming wrong encoding, for everyone else it's the second byte of "à".

Originally Posted by sirinsidiator

Try to convert your .lua file to ASCII and it should work as expected although it will break any "real" UTF8 strings you use and the letter will be rendered as a box unless you use a custom font.

I find this advice confusing. Converting Lua source to ASCII means replacing all non-ASCII characters with "\123" escapes (UTF-8-encoded, of course). Which would be tedious and wouldn't solve the OP's issue. Because "\195\160" == "à", the matching function will see the same bytes as before.

sirinsidiator · 02/14/19, 09:32 AM

I admit I may not have been completely correct about everything I wrote, but the point still stands that it is not a bug, but just wrong assumptions being made.

Since the pattern classes do not support unicode, one would need to use the appropriate replacements in order to get the expected output:

Lua Code:

local inStr = "1à1";
for outStr in inStr:gmatch("[^\t-\r ]+") do
 d(outStr) 
end

I am not sure if it would be a good idea to change the string library so it supports unicode, but doesn't follow the Lua documentation on the web anymore. Maybe they should instead add luautf8? That way we'd have a unicode enabled string library. They already added the utf8 module from Lua 5.3 after I requested it a while ago.

Originally Posted by merlight

I find this advice confusing. Converting Lua source to ASCII means replacing all non-ASCII characters with "\123" escapes (UTF-8-encoded, of course). Which would be tedious and wouldn't solve the OP's issue. Because "\195\160" == "à", the matching function will see the same bytes as before.

Guess I was not clear about that. I was referring to Notepad++'s convert feature. It attempts to exchange the byte sequences with the appropriate LATIN-1 code. Of course it won't solve anything, but it would demonstrate that the code would work if you used the expected encoding for the input.

merlight · 02/15/19, 07:34 AM

Originally Posted by sirinsidiator

I admit I may not have been completely correct about everything I wrote, but the point still stands that it is not a bug, but just wrong assumptions being made.

Yes, but wrong assumption on the side of ESOLua's string matching implementation -- the assumption being that input string is LATIN-1 encoded. The game works with UTF-8. To me that qualifies as a bug.

Originally Posted by sirinsidiator

Since the pattern classes do not support unicode, one would need to use the appropriate replacements in order to get the expected output:

Lua Code:

local inStr = "1à1";
for outStr in inStr:gmatch("[^\t-\r ]+") do
 d(outStr) 
end

There are different ways of not supporting Unicode. The C library does not support Unicode, yet it's safe to use with UTF-8 strings, because it doesn't make assumptions about bytes beyond ASCII 0-127 range. ESOLua string matching deliberately assumes an 8-bit encoding, which hinders its usability with UTF-8 input. All the special character classes like %s, %w, %u are useless, because they match arbitrary bytes inside multi-byte UTF-8 characters, because they happen to match the class in LATIN-1. So "à" matches "%u%s" (uppercase letter, then space).

Originally Posted by sirinsidiator

I am not sure if it would be a good idea to change the string library so it supports unicode, but doesn't follow the Lua documentation on the web anymore. Maybe they should instead add luautf8? That way we'd have a unicode enabled string library.

ESOLua's string.lower and string.upper have been replaced by UTF-8-aware implementation. It makes zero sense that these functions assume UTF-8, while string.match et al. assume LATIN-1 encoding. I'm not even asking for full Unicode support (although luautf8 would be nice). A good start would be if all functions in the string module made sane assumptions: if one can handle UTF-8, go for it, otherwise stick with ASCII, i.e. don't assume LATIN-1 (or any other encoding that is not a subset of UTF-8).

sirinsidiator · 02/15/19, 10:39 AM

You changed my mind. They only added UTF-8 support after Chip became our new overlord, so it's likely a remnant from before that time. Would certainly be nice if they could make it so everything works consistently.

ZOS_ChipHilseberg · 02/18/19, 10:01 AM

Tell me if this is correct. You are requesting that we replace the lua string.find with our own UTF-8 compatible pattern matching?

sirinsidiator · 02/19/19, 03:13 PM

Originally Posted by ZOS_ChipHilseberg

Tell me if this is correct. You are requesting that we replace the lua string.find with our own UTF-8 compatible pattern matching?

Right. The pattern matching for all string functions that use patterns should be able correctly match UTF8 characters. What I see in the Lua source code, it uses some standard c-methods to do the matching:

Code:

static int match_class (int c, int cl) {
  int res;
  switch (tolower(cl)) {
    case 'a' : res = isalpha(c); break;
    case 'c' : res = iscntrl(c); break;
    case 'd' : res = isdigit(c); break;
    case 'l' : res = islower(c); break;
    case 'p' : res = ispunct(c); break;
    case 's' : res = isspace(c); break;
    case 'u' : res = isupper(c); break;
    case 'w' : res = isalnum(c); break;
    case 'x' : res = isxdigit(c); break;
    case 'z' : res = (c == 0); break;
    default: return (cl == c);
  }
  return (islower(cl) ? res : !res);
}

These would have to be replaced with some other functions that support utf8, like in luautf8.