Opened 11 years ago

Closed 11 years ago

Last modified 9 years ago

#6628 closed enhancement (fixed)

Provide option to preserve ASCII/Unicode escapes in JS

Reported by: ctheiss Owned by: Adam Peller
Priority: high Milestone: 1.4.1
Component: ShrinkSafe Version: 1.1.0
Keywords: Cc: corey@…
Blocked By: Blocking:

Description (last modified by Adam Peller)

Provide a new switch on Shrinksafe to leave \xnn and \unnnn escapes in place, rather than replacing them inline. It's a trade off of space savings vs. encoding headaches, and some may care more about the latter.

Original description below:

This affects many other aspects of dojo, like I18n.

Please see the screenshot for an analysis:

1 The top pane is number.js uncompressed. Notice 1 and 2 refer to unicode characters, defined in '\uxxxx' format.

  1. The middle pane is the file on our build machine after compression. The '\uxxxx' have been replaced by 2-4 other characters (interestingly, the second character of #2 is correct).
  2. The bottom pane is how the string is interpreted by firebug, after going through the web server and encoded in URL-8. This appears to be "correct", given what the source looks like on the server

This analysis leads me to believe that the problem lies in the conversion from step 1 to 2 (i.e. ShrinkSafe), as opposed to a wonky encoding.

A related bug is #5027.

Attachments (1)

encoding.png (20.3 KB) - added by guest 11 years ago.

Download all attachments as: .zip

Change History (13)

Changed 11 years ago by guest

Attachment: encoding.png added

comment:1 Changed 11 years ago by Adam Peller

Reporter: changed from guest to ctheiss

comment:2 Changed 11 years ago by Adam Peller

So I'm still thinking this is just encoding confusion.

If you run java -jar custom_rhino.jar -c dojo/number.js, you'll see that it replaces all the \u and \x chars with the unicode character, as you'd expect, and writes it out as raw unicode. Our build process instead applies UTF-8 encoding. In my editor, I see \u2030 (per mille) represented as the octal string E280B0. If I open dijit-all.js up in TextEdit? with UTF-8 encoding, I see the per mille character.

You implied you were seeing \u and \x escapes put back in the code. What did you use to view the code? It sounds like somehow the file is being opened with the wrong encoding, likely ISO8859-1. Can you provide a live example?

comment:3 Changed 11 years ago by guest

My understanding of encoding is a little limited, so I tried opening up the resultant file in my editor (TextPad) in three different ways: ANSI (which I think is analogous to ISO8859-1), UTF-8, and binary. Here are the results of the line in _applyPattern with the per mille (\u2030) and star (\u00a4):

  1. ANSI: Weirdness from screenshot
  2. UTF-8: Encoded as expected: "<per mille symbol>" and "<star>"
  3. Binary: Per mille translates to 22 E2 80 B0 22, star translates to 22 C2 A4 22 (I included the quotes around the strings)

So, at this point, I must acquiesce to whomever understands encoding the best. I know for sure that we are serving up ISO8859-1 over tomcat, a decision probably made in order to match up against our database's encoding, which we unfortunately don't have a lot of control over.

Thanks for all your help! I will go get some coffee now and learn UTF-8 :)

comment:4 Changed 11 years ago by Adam Peller

Description: modified (diff)
Summary: Shrinksafe changes unicode charactersProvide option to preserve ASCII/Unicode escapes in JS

Right. So I think the build is working properly, and you are required to serve up Dojo content with UTF-8 or strange things like this will happen. FWIW, you ought to be able to serve up different pages in Tomcat with different encodings. I know this is just another server configuration headache.

Rather than close this ticket out, I'm going to suggest we change it to suggest that shrinksafe have the option to turn off this behavior and leave the escapes in place. Treatment of \x00 might be independent of this. Whether the build system continues to assume UTF-8 would be a separate request, but I think it was a design decision to keep things simple by reducing options assuming as single encoding.

comment:5 Changed 11 years ago by Adam Peller

Type: defectenhancement

comment:6 Changed 11 years ago by Adam Peller

Owner: changed from alex to Adam Peller
Status: newassigned

comment:7 Changed 11 years ago by ctheiss

We have recently converted our database and webserver to UTF-8, and the problem went away! So, it looks like your analysis was correct peller :)

comment:8 Changed 11 years ago by Adam Peller

Milestone: 1.21.3

UTF-8 is your friend.

Anyhow, I'd rather do this after #7127 is done.

comment:9 Changed 11 years ago by James Burke

Resolution: fixed
Status: assignedclosed

SVN/Trac integration broken? Fixed in [15529], by the patch in ticket #7913.

comment:10 Changed 9 years ago by Adam Peller

(In [21191]) -escape-unicode arg was not being passed along. Add unit test. Fixes #10645, Refs #6628

comment:11 Changed 9 years ago by Adam Peller


comment:12 Changed 9 years ago by Adam Peller

(In [21230]) Rebuilt shrinksafe. Refs #10645, #6628

Note: See TracTickets for help on using tickets.