#6628 closed enhancement (fixed)
Provide option to preserve ASCII/Unicode escapes in JS
Reported by: | ctheiss | Owned by: | Adam Peller |
---|---|---|---|
Priority: | high | Milestone: | 1.4.1 |
Component: | ShrinkSafe | Version: | 1.1.0 |
Keywords: | Cc: | [email protected]… | |
Blocked By: | Blocking: |
Description (last modified by )
Provide a new switch on Shrinksafe to leave \xnn and \unnnn escapes in place, rather than replacing them inline. It's a trade off of space savings vs. encoding headaches, and some may care more about the latter.
Original description below:
This affects many other aspects of dojo, like I18n.
Please see the screenshot for an analysis:
1 The top pane is number.js uncompressed. Notice 1 and 2 refer to unicode characters, defined in '\uxxxx' format.
- The middle pane is the file on our build machine after compression. The '\uxxxx' have been replaced by 2-4 other characters (interestingly, the second character of #2 is correct).
- The bottom pane is how the string is interpreted by firebug, after going through the web server and encoded in URL-8. This appears to be "correct", given what the source looks like on the server
This analysis leads me to believe that the problem lies in the conversion from step 1 to 2 (i.e. ShrinkSafe), as opposed to a wonky encoding.
A related bug is #5027.
Attachments (1)
Change History (13)
Changed 14 years ago by
Attachment: | encoding.png added |
---|
comment:1 Changed 14 years ago by
Reporter: | changed from guest to ctheiss |
---|
comment:2 Changed 14 years ago by
So I'm still thinking this is just encoding confusion.
If you run java -jar custom_rhino.jar -c dojo/number.js, you'll see that it replaces all the \u and \x chars with the unicode character, as you'd expect, and writes it out as raw unicode. Our build process instead applies UTF-8 encoding. In my editor, I see \u2030 (per mille) represented as the octal string E280B0. If I open dijit-all.js up in TextEdit? with UTF-8 encoding, I see the per mille character.
You implied you were seeing \u and \x escapes put back in the code. What did you use to view the code? It sounds like somehow the file is being opened with the wrong encoding, likely ISO8859-1. Can you provide a live example?
comment:3 Changed 14 years ago by
My understanding of encoding is a little limited, so I tried opening up the resultant file in my editor (TextPad) in three different ways: ANSI (which I think is analogous to ISO8859-1), UTF-8, and binary. Here are the results of the line in _applyPattern with the per mille (\u2030) and star (\u00a4):
- ANSI: Weirdness from screenshot
- UTF-8: Encoded as expected: "<per mille symbol>" and "<star>"
- Binary: Per mille translates to 22 E2 80 B0 22, star translates to 22 C2 A4 22 (I included the quotes around the strings)
So, at this point, I must acquiesce to whomever understands encoding the best. I know for sure that we are serving up ISO8859-1 over tomcat, a decision probably made in order to match up against our database's encoding, which we unfortunately don't have a lot of control over.
Thanks for all your help! I will go get some coffee now and learn UTF-8 :)
comment:4 Changed 14 years ago by
Description: | modified (diff) |
---|---|
Milestone: | 1.1.1 → 1.2 |
Summary: | Shrinksafe changes unicode characters → Provide option to preserve ASCII/Unicode escapes in JS |
Right. So I think the build is working properly, and you are required to serve up Dojo content with UTF-8 or strange things like this will happen. FWIW, you ought to be able to serve up different pages in Tomcat with different encodings. I know this is just another server configuration headache.
Rather than close this ticket out, I'm going to suggest we change it to suggest that shrinksafe have the option to turn off this behavior and leave the escapes in place. Treatment of \x00 might be independent of this. Whether the build system continues to assume UTF-8 would be a separate request, but I think it was a design decision to keep things simple by reducing options assuming as single encoding.
comment:5 Changed 14 years ago by
Type: | defect → enhancement |
---|
comment:6 Changed 14 years ago by
Owner: | changed from alex to Adam Peller |
---|---|
Status: | new → assigned |
comment:7 Changed 14 years ago by
We have recently converted our database and webserver to UTF-8, and the problem went away! So, it looks like your analysis was correct peller :)
comment:8 Changed 14 years ago by
Milestone: | 1.2 → 1.3 |
---|
UTF-8 is your friend.
Anyhow, I'd rather do this after #7127 is done.
comment:9 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:10 Changed 12 years ago by
comment:11 Changed 12 years ago by
Milestone: | 1.3 → 1.4.1 |
---|
yup. we need to look no further than our own build:
http://download.dojotoolkit.org/release-1.1.0/dojo-release-1.1.0/dijit/dijit-all.js