Ticket #5393 (new Feature Request)

Opened 6 years ago

Last modified 3 years ago

Use htmltext_lexicon instead of plone_lexicon?

Reported by: thowe Owned by: hazmat
Priority: minor Milestone: Future
Component: Infrastructure Keywords: PortalTransforms html_to_text catalog index splitting search
Cc: tw_switzerland@…

Description

I had to learn, that some of our texts were not searchable in Plone because some keywords simply seemed to be missing in the catalog.

An introspection showed, that words separated by some HTML-tags only found their way into catalog in a concatenated form. At the end of my error search I landed at the "html_to_text.py" transform file inside the Product PortalTransforms.

In this file, HTML text is not converted correctly into text/plain. Some examples:

"word1<br>word2"  -->  "word1word2" (should be "word1\nword2")
"<td>word1</td><td>word2</td>" --> "word1word2" (should be "word1 word2")

PortalTransforms\transforms\html_to_text.py:

The problems are the current replacement expressions:
    return html_to_text("html_to_text",
                       ('<script [^>]>.*</script>(?im)', ''),
                       ('<style [^>]>.*</style>(?im)', ''),
                       ('<head [^>]>.*</head>(?im)', ''),
                       ('(?im)<(h[1-6r]|address|p|ul|ol|dl|pre|div|center|blockquote|form|isindex|table)(?=\W)[^>]*>', ' '),
                       ('<[^>]*>(?i)(?m)', ''),
                       )

As a workaround for indexing purposes, I added some more tags that should be replaced by a space (br, td, li). Of course it would be better to replace a <br> by a newline and a <td> by | for example.

After fixing this, one has to rebuild the catalog in order to get contents indexed correctly. I think this should be fixed ASAP, as the bug makes content searches quite useless.

Change History

comment:1 Changed 6 years ago by wichert

  • Owner changed from alecm to wichert
  • Status changed from new to assigned

All elements removed should introduce a word bounary in the converted text, not just a select few.

comment:2 Changed 6 years ago by wichert

  • Status changed from assigned to closed
  • Resolution set to fixed

(In [9476]) Force a catalog reindex on upgrade to get correct work boundaries on transformed html content. fixes #5393

comment:3 Changed 6 years ago by thowe

  • Status changed from closed to reopened
  • Resolution fixed deleted

Not all removed HTML-tags should result in a word boundary! Only the HTML-tags expressing structural infomration should result in word separation.

Especially the only formating tags should be removed without introducing a word boundary:

Example where no word boun<b>dary</b> sh<strong>o</strong>uld be introduced.

comment:4 Changed 6 years ago by shh

Isn't there some freakin' HTML splitter/stripper library around? This doesn't look like something we need to reinvent, does it?

comment:5 Changed 6 years ago by wichert

We also want to limit our external dependencies though.

comment:6 Changed 6 years ago by shh

Another comment (which just got eaten by effin' Trac!): Why are these conversions not done by the indexes? ZCTextIndex can use a htmltext_lexicon just fine. Our indexes use a "plone_lexicon" (for historical reasons?) which is not configured to HTML split. I don't know anything about PortalTransforms, unfortunately, but I find it hard to believe that HTML splitting/stripping has to be done by hand...

comment:7 Changed 6 years ago by wichert

While shh figures out how we can reuse htmltext_lexicon I updated the current PortalTransforms implementation to not introduce word boundaries on formatting-elements.

comment:8 Changed 6 years ago by hannosch

  • Milestone changed from 2.1.3 to 2.5

comment:9 Changed 6 years ago by wichert

  • Status changed from reopened to new
  • Owner changed from wichert to shh

comment:10 Changed 6 years ago by alecm

  • Milestone changed from 2.5 to 2.5.x

comment:11 Changed 6 years ago by alecm

  • Priority changed from major to minor
  • Type changed from defect to enhancement

Since wichert has fixed the core issue, and what is needed now is a prettier implementation, I'm calling this a minor enhancement.

comment:12 Changed 5 years ago by limi

  • Owner changed from shh to hazmat
  • Summary changed from PortalTransforms: html_to_text not splitting correctly to Use htmltext_lexicon instead of plone_lexicon?

The only person I know has knowledge about the lexicon implementations is Kapil — would you mind having a look, or reassigning it back to me if not? Thanks!

comment:13 Changed 4 years ago by hannosch

  • Milestone changed from 3.x to Future

comment:14 Changed 3 years ago by hannosch

  • Component changed from Content Types to Infrastructure
Note: See TracTickets for help on using tickets.