Ticket #5393 (new Feature Request)
Use htmltext_lexicon instead of plone_lexicon?
| Reported by: | thowe | Owned by: | hazmat |
|---|---|---|---|
| Priority: | minor | Milestone: | Future |
| Component: | Infrastructure | Keywords: | PortalTransforms html_to_text catalog index splitting search |
| Cc: | tw_switzerland@… |
Description
I had to learn, that some of our texts were not searchable in Plone because some keywords simply seemed to be missing in the catalog.
An introspection showed, that words separated by some HTML-tags only found their way into catalog in a concatenated form. At the end of my error search I landed at the "html_to_text.py" transform file inside the Product PortalTransforms.
In this file, HTML text is not converted correctly into text/plain. Some examples:
"word1<br>word2" --> "word1word2" (should be "word1\nword2") "<td>word1</td><td>word2</td>" --> "word1word2" (should be "word1 word2")
PortalTransforms\transforms\html_to_text.py:
The problems are the current replacement expressions:
return html_to_text("html_to_text",
('<script [^>]>.*</script>(?im)', ''),
('<style [^>]>.*</style>(?im)', ''),
('<head [^>]>.*</head>(?im)', ''),
('(?im)<(h[1-6r]|address|p|ul|ol|dl|pre|div|center|blockquote|form|isindex|table)(?=\W)[^>]*>', ' '),
('<[^>]*>(?i)(?m)', ''),
)
As a workaround for indexing purposes, I added some more tags that should be replaced by a space (br, td, li). Of course it would be better to replace a <br> by a newline and a <td> by | for example.
After fixing this, one has to rebuild the catalog in order to get contents indexed correctly. I think this should be fixed ASAP, as the bug makes content searches quite useless.
Change History
comment:1 Changed 6 years ago by wichert
- Owner changed from alecm to wichert
- Status changed from new to assigned
comment:2 Changed 6 years ago by wichert
- Status changed from assigned to closed
- Resolution set to fixed
comment:3 Changed 6 years ago by thowe
- Status changed from closed to reopened
- Resolution fixed deleted
Not all removed HTML-tags should result in a word boundary! Only the HTML-tags expressing structural infomration should result in word separation.
Especially the only formating tags should be removed without introducing a word boundary:
Example where no word boun<b>dary</b> sh<strong>o</strong>uld be introduced.
comment:4 Changed 6 years ago by shh
Isn't there some freakin' HTML splitter/stripper library around? This doesn't look like something we need to reinvent, does it?
comment:6 Changed 6 years ago by shh
Another comment (which just got eaten by effin' Trac!): Why are these conversions not done by the indexes? ZCTextIndex can use a htmltext_lexicon just fine. Our indexes use a "plone_lexicon" (for historical reasons?) which is not configured to HTML split. I don't know anything about PortalTransforms, unfortunately, but I find it hard to believe that HTML splitting/stripping has to be done by hand...
comment:7 Changed 6 years ago by wichert
While shh figures out how we can reuse htmltext_lexicon I updated the current PortalTransforms implementation to not introduce word boundaries on formatting-elements.
comment:9 Changed 6 years ago by wichert
- Status changed from reopened to new
- Owner changed from wichert to shh
comment:11 Changed 6 years ago by alecm
- Priority changed from major to minor
- Type changed from defect to enhancement
Since wichert has fixed the core issue, and what is needed now is a prettier implementation, I'm calling this a minor enhancement.
comment:12 Changed 5 years ago by limi
- Owner changed from shh to hazmat
- Summary changed from PortalTransforms: html_to_text not splitting correctly to Use htmltext_lexicon instead of plone_lexicon?
The only person I know has knowledge about the lexicon implementations is Kapil — would you mind having a look, or reassigning it back to me if not? Thanks!

All elements removed should introduce a word bounary in the converted text, not just a select few.