Opened 15 years ago

Closed 15 years ago

#4395 closed New Feature (fixed)

Use htmldataprocessor to refactor pasting processor

Reported by: Garry Yao Owned by: Garry Yao
Priority: Normal Milestone: CKEditor 3.1
Component: General Version:
Keywords: Paste Confirmed Cc:

Description (last modified by Garry Yao)

We should start using htmldataprocessor when processing with the pasting input, instead of current implementation which based on regexp exclusively, such a infrustructure would bring benefits in many sense:

  1. Allow structure transformation to happen easily toward the source instead of simply cleanup, e.g. MS-WORD created middot bullet -> HTML unordered list;
  2. Leveraging all the existing rules we currently have for output, e.g. flash object, namespaces tags;
  3. It will be much more easy for developer to extend/customize by adding/altering the rules.

Change History (5)

comment:1 Changed 15 years ago by Garry Yao

Description: modified (diff)
Status: newassigned
Summary: Use htmldataprocessor to refactor pasting clean upUse htmldataprocessor to refactor pasting processor

comment:2 Changed 15 years ago by Garry Yao

Keywords: Paste added

Changes committed with [4207] in pasting branch.

comment:3 Changed 15 years ago by Garry Yao

Migrate all the regexp based rules in 'cleanWord' function to be based on filter rules with [4208].

comment:4 Changed 15 years ago by Garry Yao

It's noticed that there's one significant impedance mismatch between the old regexp based and the current filter based one:
The old approach is linear, multiple-pass parsing, while our html filter is a top-down, one-pass procedure, which make difficulties for some of the rule's migration.

Considering the following example, which should be correctly cleaned up as a single  .

	<span lang=EN-GB style='font-family:Calibri'>
		<o:p> &nbsp;</o:p>
	</span>


The old rules related to this were:

html = html.replace(/<o:p>\s*<\/o:p>/g, '') ;
html = html.replace(/<o:p>[\s\S]*?<\/o:p>/g, '&nbsp;') ;
html = html.replace( /<SPAN\s*[^>]*>\s*&nbsp;\s*<\/SPAN>/gi, '&nbsp;' ) ;
html = html.replace( /<SPAN\s*[^>]*><\/SPAN>/gi, '' ) ;

The new rules would ideally be the following but actually was wrong because the 'span' rule will always be execute first( determinate by tree order ):

elements :
{
	$ : function( element )
	{
		var tagName = element.name;

		if( tagName == 'span' )
		{
			var child;
			if ( ( child = onlyChildOf( element ) )
				 && /(:?\s|&nbsp;)+/.exec( child.value ) )
				...Drop this element, preserve childs...
		}
		else if( tagName == 'o:p' )
		{
			...Drop this element, preserve childs...
		}
	}

In such case, the filter must have one mechanism to properly perform the filtering from bottom to top( allow children to be filtered before itself ), in this concrete example will execute the <o:p> rule, then the <span> rule.

I'm adding one function CKEDITOR.htmlParser.element::filterChildren to allow this happen like the following when necessary, changes were checked in at the pasting branch with [4218].

if( tagName == 'span' )
{
	// Filter down the childrens first.
	element.filterChildren();

	var child;
	if ( ( child = onlyChildOf( element ) )
		 && /(:?\s|&nbsp;)+/.exec( child.value ) )
		...Drop this element, preserve childs...
}

comment:5 Changed 15 years ago by Garry Yao

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.
© 2003 – 2022, CKSource sp. z o.o. sp.k. All rights reserved. | Terms of use | Privacy policy