Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#9675 closed defect (invalid)

A problem with some UTF8 symbols

Reported by: Aleksey Rechinskiy Owned by: alex
Priority: high Milestone: tbd
Component: ShrinkSafe Version: 1.3.2
Keywords: Cc:
Blocked By: Blocking:

Description

Hi. I've searched for a special notices about UTF8 support in ShrinkSafe?, but found nothing interesting except that it looks like ShrinkSafe? _should_ support utf8. I wanted to ask about this issue in Dojo forums first, but forums are closed, so, I guess, it will be no harm if I write it here and it will appear it is my mistake.

Ok. I've used files with UTF8 strings for a long time with ShringSafe? (almost a year) and have no problems with them until today with a simple common Russian word 'Это' (this).

Here is the test strings:

var str='Это';
var str_1="Это";
var str2='Ещё кириллица';
var str3='Первая проблема с кириллицей за год';

running ShrinkSafe? (bundled in dojo 1.3.2) on them produces invalid result with the first two strings:

var str="ё‚о";
var str_1="ё‚о";
var str2="Ещё кириллица";
var str3="Первая проблема с кириллицей за год";

Last two strings processed correctly. I've never encountered the problems like this with ShrinkSafe? before...

Here is the hex dump of start of original file:

00000000:  76 61 72 20-73 74 72 3D-27 D0 AD D1-82 D0 BE 27  var str='╨н╤В╨╛'
00000010:  3B 0A 76 61-72 20 73 74-72 5F 31 3D-22 D0 AD D1  ;
var str_1="╨н╤
00000020:  82 D0 BE 22-3B 0A 76 61-72 20 73 74-72 32 3D 27  В╨╛";
var str2='

And shrinksafe'ed file:

00000000:  76 61 72 20-73 74 72 3D-22 D0 D1 82-D0 BE 22 3B  var str="╨╤В╨╛";
00000010:  0A 76 61 72-20 73 74 72-5F 31 3D 22-D0 D1 82 D0  
var str_1="╨╤В╨
00000020:  BE 22 3B 0A-76 61 72 20-73 74 72 32-3D 22 D0 95  ╛";
var str2="╨Х

Here is a command line:

java -jar shrinksafe.jar %THIS_DIR%\shs_test.js > %THIS_DIR%\shs_test.compiled.js

Is it an expected behavior of ShrinkSafe? and I shouldn't use it on UTF8 strings, or it is some sort of a bug?

Change History (14)

comment:1 Changed 10 years ago by Aleksey Rechinskiy

dijit/form/nls/ru/validate.js contains word 'Это' at the start of strings:

({
	invalidMessage: "Указано недопустимое значение.",
	missingMessage: "Это обязательное значение.",
	rangeMessage: "Это значение вне диапазона."
})

This strings are processed correctly during custom build (but the processing may be different, because the strings is interned to a special _ru.js file into /nls/ subfolder.

Am I doing something wrong with ShrinkSafe?? Any feedback would be appreciated, thanks...

comment:2 Changed 10 years ago by Aleksey Rechinskiy

And by the way, I've just found the usage of similar word "Этот" in my source files that are baked into custom build. This word is processed correctly.

I've looked at ShrinkSafe? command line options (-? switch), but found nothing related to an UTF8 handling.

comment:3 Changed 10 years ago by James Burke

I am not aware of a specific deficiency in utf8 support, my first guess is that perhaps the file was not saved as utf8. Can you confirm the file is truly a utf8 file?

comment:4 Changed 10 years ago by Adam Peller

By default, Shrinksafe picks up the JVM default encoding iirc. Our build scripts explicitly override that with UTF-8, load the files into memory, then pass the unencoded Java strings to Shrinksafe through the API. You might want to find the JVM setting that sets the system encoding and see if setting it to UTF-8 helps.

Also, the forums should have been redirected to dojo-interest.

comment:5 Changed 10 years ago by Aleksey Rechinskiy

Can you confirm the file is truly a utf8 file?

Yes, the file is truly UTF8-encoded.

By default, Shrinksafe picks up the JVM default encoding

Hmm. I'm sorry if I'm wrong, but why it uses JVM default encoding? Standart of encoding of javascript files is UTF8, but not the JVM encoding. Shouldn't the ShrinkSafe? use UTF8 by default?

You might want to find the JVM setting that sets the system encoding and see if setting it to UTF-8 helps.

Thanks for pointing this out. I'll try to explicitly set JVM encoding to UTF8, as soon as I find where this setting is located. I'll post a feedback here when I get results...

Also, the forums should have been redirected to dojo-interest.

dojo-interest is a mailing-list? ( If someone cares, I don't feel comfortable using mailing lists, forums is much more convenient. And I know a lot of peoples who share my position.. But it is an offtopic here ;) )

Ok, I'll try to set UTF8 for JVM and post results here later...

comment:6 in reply to:  4 Changed 10 years ago by Aleksey Rechinskiy

Bingo! I have explicitly set an UTF8 encoding by the following command line and it works fine now.

java -jar -Dfile.encoding=UTF8 shrinksafe.jar %THIS_DIR%\shs_test.js > %THIS_DIR%\shs_test.compiled.js 

May I ask someone to document this behavior somewhere in ShrinkSafe? documentation? I think other developers can fall into this issue too. If possible, I think it would be much better, if the ShrinkSafe? by default use the UTF8.

Anyway, thanks a lot for help!

comment:7 Changed 10 years ago by Adam Peller

Resolution: invalid
Status: newclosed

It was documented, somewhere, but it must have gotten lost in the shuffle. We need help with the docs, still. StackOverflow? is probably the right way to resolve stuff like this.

comment:8 Changed 10 years ago by Aleksey Rechinskiy

Ok, I'll write a questions like this in StackOverflow? first, I'm sorry to bother.

About "need help with the docs"

I'm really sorry. I love Dojo. I've been working with it since october'08, and I'm still delighted, how it is made. I've just written a reply to you asking to take my little help with the docs, but... I had to discard the reply. Most of days I don't have a free time to even spend it with my family, so... What can I really offer? Just a few words to cover some holes in existed docs in areas, I'm currently working in now. I would be really happy to help, but I fear I just can't offer really valuable amount of work to you...

comment:9 Changed 10 years ago by Adam Peller

I just updated the ShrinkSafe? docs, which still need a little love:

http://docs.dojocampus.org/shrinksafe/index

If you use StackOverflow?, until the community is fully primed on it, you may still have to 'ping' the commmunity for an answer :( I understand how you feel about the mailing lists.

comment:10 in reply to:  8 ; Changed 10 years ago by Eugene Lazutkin

Replying to Arech:

Ok, I'll write a questions like this in StackOverflow first, I'm sorry to bother.

The mailing list is carried on the web by both Gmane (http://dir.gmane.org/index.php?prefix=gmane.comp.web.dojo --- dojo.devel is for contributors, dojo.user is most probably what you you need) and Nabble (http://www.nabble.com/Dojo-%282%29-f36462.html). They both provide nice forum-like interfaces, and you can read and post right from their UI.

While I am not a fan of forums, I use Gmane as an NNTP server in my Thunderbird to read Dojo (and other) mailing lists as news groups.

About "need help with the docs"

I understand that you are busy just like the rest of us. Nobody expects you to write "War and Peace" of Dojo docs. But even small hints, gotchas, and clarifications are helpful for other users. Every little bit is appreciated. The best is to add them as you discover something --- developers are often set in their ways, so they miss frequently that some stuff is not obvious. So don't even think twice if you see a typo, an inaccuracy, an omission, or have an idea for small example.

comment:11 in reply to:  10 ; Changed 10 years ago by Aleksey Rechinskiy

Replying to elazutkin:

Hi, Eugene :) I still didn't get my project to international version and can't help with dojo's translation to Russian, as we spoke a few months ago in Russian-speaking Dojo's googlegroup.. But I still want to do this and working on it.

Thanks for the links, newsgroup/forum-like style looks much better than mailing lists.

... So don't even think twice if you see a typo, an inaccuracy, an omission, or have an idea for small example.

Ok, that is what I'll do. BTW: It is rather unclear, that when a user wants to register in docs.dojocampus.org, he should go to dojotoolkit.org, register there and use that credintals to log in to docs.dojocampus.org... I'd thought a special registration is required for docs.dojocampus.org, until I've just got gotcha! and tried to login with my dojotoolkit login. BTW2: UserPreferences? page (like http://docs.dojocampus.org/welcome?action=userprefs ) doesn't save new data? I get "ImportError?" each time I try to save even unchanged page.

comment:12 in reply to:  11 ; Changed 10 years ago by Eugene Lazutkin

Replying to Arech:

No problem. I think we will sort the dojocampus.org registration kinks out.

BTW, the internationalization is one big word we need help with. ;-) The other two big words we need extra help with are: localization and accessibility.

Majority of developers use US English, and it kind of limits us: we don't spot problems obvious to the international audience. It is very important that this hint on UTF8 is not lost. It would be even better to have a test case for it.

comment:13 in reply to:  12 Changed 10 years ago by Adam Peller

Replying to elazutkin:

Majority of developers use US English, and it kind of limits us: we don't spot problems obvious to the international audience. It is very important that this hint on UTF8 is not lost. It would be even better to have a test case for it.

Eugene, agreed, we can always use the non-US English testing, but this is a little different because it's an issue with system configuration and tools. I'm guessing that the JDK picks up its defaults from the OS, or perhaps it's a certain distribution of the JVM which assumes some other encoding. Like the timezone problems, it would be difficult to test for. We'd need not just the unit test cases, but all possible system configurations.

Perhaps it would make sense to default Shrinksafe's encoding rather than use the JVM's setting. I haven't thought through the consequences. That's what we did with our tools, but we control the content. Does JavaScript? really define a default encoding? I think in reality, UTF8 is usually the right thing, but I'm not sure this is mandated by the spec.

comment:14 Changed 10 years ago by Eugene Lazutkin

According to the ECMA standard (I am looking at the draft (tc39-2009-025.pdf) section 6: Source Text --- it was carried over from the latest active standard):

ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later, using the UTF-16 transformation format. The text is expected to have been normalised to Unicode Normalised Form C (canonical composition), as described in Unicode Technical Report # 15. Conforming ECMAScript implementations are not required to perform any normalisation of text, or behave as though they were performing normalisation of text, themselves.

That's how JavaScript works with all characters --- as Unicode encoded by 16-bit characters. (A side note: AFAIK UTF-16 was obsoleted by Unicode long time ago, but we have what we have).

In the Real Life (tm) ASCII and UTF-8 should be fine. I am kind of surprised that Rhino does not analyze BOM (or its absence) to make the right decision about the input.

Note: See TracTickets for help on using tickets.