Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#9886 closed defect (fixed)

Improve the HTML processing function of Editor

Reported by: Jared Jurkiewicz Owned by:
Priority: high Milestone: 1.4
Component: Editor Version: 1.3.2
Keywords: Cc: bill, liucougar
Blocked By: Blocking:

Description

I was debugging a problem reported by a user of Editor that it was mangling <script> tags contained within editor content. I informed the user than putting script tags into editor content is extremely dangerous and opens the door to XSS tags.

Well, in some cases they want to allow it anyway, as they have content filtering serverside to strip out scripts that worry them. While I still think it's a bad idea, the investigation lead me to:

dijit/_editor/html.js

I took a look at '_getNodeHtml' and realized it could be improved in general, not just to fix the script corruption problems, but to remove the IE Specific code branch and use a common processing for all.

The cleanup actually comes from things I learned when implementing dojox.html.format.prettyPrint for Dojo 1.4. I figured out a clean, common way to get the outer HTML wrapper of a node (with just user set attributes), without having to browser special case or worry about any content of its children nodes.

Anyway, I've thrown together a prototype cleanup of that function that makes use of the common way to process nodes (and fixed the script handling issue while I was at it).

There are other node types some browsers handle bad (<pre> for instance in FireFox?, see the prettyPrint function for what it had to do), that I didn't include fixups for (but I could.).

Anyway, what do people think of these cleanups. It definitely improved some things for me when returning HTML content.

Patch coming.

Attachments (3)

dijit._editor.html.patch (4.4 KB) - added by Jared Jurkiewicz 10 years ago.
Patch for some cleanup/commonization of node processing.
performance.zip (15.8 KB) - added by Jared Jurkiewicz 10 years ago.
The tests I used to gen those numbers. Extract under dijit/tests/editor
scripthandling.patch (1.4 KB) - added by Jared Jurkiewicz 10 years ago.
Patch with just the script handling fixed. (Node parsing is left alone)

Download all attachments as: .zip

Change History (12)

Changed 10 years ago by Jared Jurkiewicz

Attachment: dijit._editor.html.patch added

Patch for some cleanup/commonization of node processing.

comment:1 Changed 10 years ago by liucougar

a div is created for each node and an extra cloneNode, so for big document, this will make the serializing slower

the outerHTML call in there is to gain some performance for slow IE.

if you do a benchmark of the new approach and the current approach, I expect that the latter is much faster.

comment:2 Changed 10 years ago by Jared Jurkiewicz

I'm not so sure about that. Yes, calling outerHTML may be a little faster in getting the node text ... but consider that outerHTML also returns all the node content inside the DOM node, whereas my approach does not (it does not clone deep, so you obtain:

<div class="myClass" style="font-weight: bold; font-style: italic; width: 100%; height: 20px; display: none;"></div>

Instead of say: <DIV class=myClass style="DISPLAY: none; FONT-WEIGHT: bold; WIDTH: 100%; FONT-STYLE: italic; HEIGHT: 20px">This is some random <B>text inside the dom</B> it is intended to <I>help</I> estimate performance of this operation.<SPAN>so ...</SPAN> whee. STOP!</DIV>

So for large documents, your outerHTML calls are going to return *huge* strings.

And then further into the code you perform:

s = s.substr(0, s.indexOf('>'))

.replace(/(])[^?*\1/g, ); to make the following regexp safe

var reg = /(\b\w+)\s?=/g; var m, key; while((m = reg.exec(s))){

So you end up performing a regular expression on a far longer string, which will take a lot more time than if you just had the container node info and none of its contents to parse. so for large documents, I suspect it'll be a wash, or maybe slightly faster, to do it my way.

Anyway, I'm going to see about testing that to be sure.

Initial results (just testing outerHTML) show:

TEST: outerHTML_impl
TRIAL SIZE: 320 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 98.8600000000ms.
MAXIMUM TEST ITERATION TIME: 0.3906250000ms.
MINIMUM TEST ITERATION TIME: 0.1468750000ms.
AVERAGE TEST ITERATION TIME: 0.3089375000ms.
MEDIAN TEST ITERATION TIME: 0.3406250000ms.
VARIANCE TEST ITERATION TIME: 0.0053445352ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 0.0731063277ms.


TEST: outerHTML_Native
TRIAL SIZE: 1280 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 147.1800000000ms.
MAXIMUM TEST ITERATION TIME: 0.1710937500ms.
MINIMUM TEST ITERATION TIME: 0.0976562500ms.
AVERAGE TEST ITERATION TIME: 0.1149843750ms.
MEDIAN TEST ITERATION TIME: 0.1093750000ms.
VARIANCE TEST ITERATION TIME: 0.0004482590ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 0.0211721287ms.

Show using outerHTML on IE is roughly 3x faster to get the node ... but it's the other processing that you do that I think will negate that.

I'll try to set up a test with a large document to verify and see if I can get a rough number on how much of a difference it is overall.

(Note that none of the times are particularly slow, they're all sub ms.)

comment:3 Changed 10 years ago by Jared Jurkiewicz

Here's some quick and dirty results using lorum-ipsum documents (p tags with a LOT of text).

smalldoc is a lorum-ipsum doc of about 4KB
largedoc is a lorum-ipsum doc of about 41KB

FireFox? 2 (old way):


TEST: smallDoc
TRIAL SIZE: 40 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 89.3600000000ms.
MAXIMUM TEST ITERATION TIME: 7.0250000000ms.
MINIMUM TEST ITERATION TIME: 1.1500000000ms.
AVERAGE TEST ITERATION TIME: 2.2340000000ms.
MEDIAN TEST ITERATION TIME: 1.9500000000ms.
VARIANCE TEST ITERATION TIME: 1.9510440000ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 1.3967977663ms.

TEST: largeDoc
TRIAL SIZE: 10 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 146.0600000000ms.
MAXIMUM TEST ITERATION TIME: 28.2000000000ms.
MINIMUM TEST ITERATION TIME: 10.9000000000ms.
AVERAGE TEST ITERATION TIME: 14.6060000000ms.
MEDIAN TEST ITERATION TIME: 12.5000000000ms.
VARIANCE TEST ITERATION TIME: 38.7761640000ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 6.2270509874ms.

FireFox? 2 (new way):


TEST: smallDoc
TRIAL SIZE: 40 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 156.2600000000ms.
MAXIMUM TEST ITERATION TIME: 25.0000000000ms.
MINIMUM TEST ITERATION TIME: 1.5500000000ms.
AVERAGE TEST ITERATION TIME: 3.9065000000ms.
MEDIAN TEST ITERATION TIME: 1.9500000000ms.
VARIANCE TEST ITERATION TIME: 23.0504702500ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 4.8010905272ms.

TEST: largeDoc
TRIAL SIZE: 10 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 215.9000000000ms.
MAXIMUM TEST ITERATION TIME: 37.5000000000ms.
MINIMUM TEST ITERATION TIME: 15.6000000000ms.
AVERAGE TEST ITERATION TIME: 21.5900000000ms.
MEDIAN TEST ITERATION TIME: 17.2000000000ms.
VARIANCE TEST ITERATION TIME: 65.8621000000ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 8.1155468084ms.


So for FireFox? 2, it does impact is some, which isn't surprising given FF2 before didn't do anything with outerHTML. It changes performance, on average, about 30% slower, but we're still talking small times here. Even for a 40KB doc it parsed and returned in 20ms.

Now ... IE (7):


IE7 old way:
TEST: smallDoc
TRIAL SIZE: 40 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 125.0000000000ms.
MAXIMUM TEST ITERATION TIME: 3.1250000000ms.
MINIMUM TEST ITERATION TIME: 3.1250000000ms.
AVERAGE TEST ITERATION TIME: 3.1250000000ms.
MEDIAN TEST ITERATION TIME: 3.1250000000ms.
VARIANCE TEST ITERATION TIME: 0.0000000000ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 0.0000000000ms.

TEST: largeDoc
TRIAL SIZE: 5 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 87.8600000000ms.
MAXIMUM TEST ITERATION TIME: 34.4000000000ms.
MINIMUM TEST ITERATION TIME: 15.6000000000ms.
AVERAGE TEST ITERATION TIME: 17.5720000000ms.
MEDIAN TEST ITERATION TIME: 15.6000000000ms.
VARIANCE TEST ITERATION TIME: 22.2912160000ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 4.7213574319ms.


IE7 New way:


TEST: smallDoc
TRIAL SIZE: 40 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 118.1000000000ms.
MAXIMUM TEST ITERATION TIME: 3.1250000000ms.
MINIMUM TEST ITERATION TIME: 1.5500000000ms.
AVERAGE TEST ITERATION TIME: 2.9525000000ms.
MEDIAN TEST ITERATION TIME: 3.1250000000ms.
VARIANCE TEST ITERATION TIME: 0.2047312500ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 0.4524723748ms.

TEST: largeDoc
TRIAL SIZE: 5 iterations
NUMBER OF TRIALS: 50
AVERAGE TRIAL EXECUTION TIME: 105.6600000000ms.
MAXIMUM TEST ITERATION TIME: 34.4000000000ms.
MINIMUM TEST ITERATION TIME: 15.6000000000ms.
AVERAGE TEST ITERATION TIME: 21.1320000000ms.
MEDIAN TEST ITERATION TIME: 18.8000000000ms.
VARIANCE TEST ITERATION TIME: 24.5657760000ms.
STANDARD DEVIATION ON TEST ITERATION TIME: 4.9563873941ms.

So ... looking at the averages:

Small:
OLD: AVERAGE TEST ITERATION TIME: 3.1250000000ms.
NEW: AVERAGE TEST ITERATION TIME: 2.9525000000ms.

Large:
OLD: AVERAGE TEST ITERATION TIME: 17.5720000000ms.
NEW: AVERAGE TEST ITERATION TIME: 21.1320000000ms.


Not a huge difference, really. It washes out on IE 7. The cost of the regexp parsing over all the extra junk in the string you don't get through the clone way ends up killing the extra cost of the cloneNode.

So, eh.

The times areen't hugely different in the end (everything is still fast), and it makes the code more common across browsers in how tags are handled. Still, I'm not that caught up in having to change that parsing. I was more noting that commonizing the handling might be nice to do (one codeflow versus 2), and we're likely to get more common results that way. But really, it's not that big a thing.

The issue with it mishandling scripts, though, is very real and something we should probably fix. The fix for that is just the part of the patch that does:

			if(lName === "script"){
				// Browsers handle script tags differently in how you get content, 
				// but innerHTML always seems to work, so insert its content that way
				// Yes, it's bad to allow script tags in the editor code, but some people
				// seem to want to do it.
				output += '>' + node.innerHTML +'</' + lName + '>';
			}else{
				if(node.childNodes.length){
					output += '>' + dijit._editor.getChildrenHtml(node)+'</'+node.nodeName.toLowerCase()+'>';
				}else{
					output += ' />';
				}
			}
			break;
		case 4: // cdata 
		case 3: // text
			// FIXME:
			output = dijit._editor.escapeXml(node.nodeValue, true);
			break;

That does resolve the inner contents of scripts being mangled by the RTE parse/gen of HTML. Would you be okay with me fixing at least that issue?

And would either of you know why in the world IE simply destroys a script tag if it's the first line/tag of the content? For example:

editor.setValue("<script src="foo.js" type="javascript"></script>"); console.log(editor.getValue()); returns empty.

Whereas: editor.setValue("<br><script src="foo.js" type="javascript"></script> "); console.log(editor.getValue()); returns <br /><script src="foo.js" type="javascript"></script>

Any ideas?

Changed 10 years ago by Jared Jurkiewicz

Attachment: performance.zip added

The tests I used to gen those numbers. Extract under dijit/tests/editor

Changed 10 years ago by Jared Jurkiewicz

Attachment: scripthandling.patch added

Patch with just the script handling fixed. (Node parsing is left alone)

comment:4 Changed 10 years ago by liucougar

the current way does not actually do a replace on the whole junk: notice the substr call below: it extracts the first opening tag only

s = s.substr(0, s.indexOf('>'))
    .replace(/("])[^"?*\1/g, ); //to make the following regexp safe

the script change is fine, but I don't really see why it should be part of the default editor, maybe this should be added in a plugin? (by default, I think the editor should remove all script tags)

IE has some wired behavior with script tags, so I think you are just encountering IE bugs.

comment:5 Changed 10 years ago by Jared Jurkiewicz

Milestone: tbd1.4

Fixing just the script issue. It has to be fixed here as this comes before any plugin ever does anything. So if scripts can be in content (and some people do want it, as crazy as it is), it has to serialize them out right here. Other plugins can surely strip them though, for example, the ViewSource? one does, it won't allow you to include scripts by default.

comment:6 Changed 10 years ago by Jared Jurkiewicz

Resolution: fixed
Status: newclosed

(In [20104]) Fixing script tag content handling. fixes #9886

comment:7 Changed 10 years ago by bill

I agree the performance isn't an issue since it's fast (and about the same speed) either way. The code size is about the same too.

It definitely improved some things for me when returning HTML content.

Can you be more specific (ie, give an example)? If this fixes something we should incorporate it.

Editor that it was mangling <script> tags contained within editor content

Can you give an example?

I'm not really interested in dijit's Editor supporting script tags, despite the fact that some user asked for it. It creates a testing and support burden for us as well as the perception of a security problem (despite the technical reality that true security needs server-side filtering).

Actually I'm confused about what "supporting scripts" even means. Seems like the only way they could show up would be through the "View Source" plugin?

comment:8 Changed 10 years ago by Jared Jurkiewicz

(In [20225]) Committing in minor tweak to improve xhtml stle tag closing and unknown tag closing. Found problems on FF when using /> closures on unknown tags. refs #9886

comment:9 Changed 10 years ago by bill

Talked to Jared about the mangled <script> tags. The issue was with serializing the editor's content (ie, getting the value), and the problem was that FF/safari were converting script nodes like <script> ... " ... </script> into <script> ... &quot; ... </script>. IE was just dropping script nodes.

I'm uneasy about having any support for script tags in editor just because it implies a commitment to support them down the road, but I guess we'll leave it this way for now, just without any official statement that we are supporting them.

Note: See TracTickets for help on using tickets.