Ticket #9914 (closed Bug: worksforme)

Opened 4 years ago

Last modified 4 years ago

Strange Shortname on Japanese

Reported by: terapyon Owned by: hannosch
Priority: blocker Milestone: 4.0
Component: Internationalization Version:
Keywords: shortname normalize Japanse Cc: terada@…

Description

Description

We have changed plone.i18n.base.baseNormalize function what is using to create URL and shortname.

This modification is big problem on Japanese. I think there should revert soon.

Detail

I think this is modified by #9532, there is how to non-ascii to ascii by using "Unidecode" package. But we have vary strange shortname on Japanese.

I think old style(like Plone3.x) is not best, but better choice. New style(like Plone4.0a2) is bad and big problem. Plone4.0a is not using on Japanese.

I think Unidecode is no good.

suggestion

This modification should revert soon. Or, we have option for this function.

Question

Why we nothing have PLIP voting, did you modify it.

Change History

comment:1 Changed 4 years ago by papago

  • Priority changed from minor to blocker

[Follow up information]

On Plone4, an object with multi-bite character title is mapped strange Chinese pronunciation ID(shortname). I think this causes a lot of confusion and confrontation for multi-byte character users who are out side of China. Especially, enterprise usage a web manager will not allow strange Chinese pronunciation ID on his web site.

Unless Plone4 can map unicode title to pronunciation sets for given language setting, we should not use this mechanism. And for the moment, we should stop using this function as a default standard for multi-bite title. We should use the old unicode shortname for now.

We should avoid this kind of confrontation mechanism. This should ticket should be blocker for Plone4 release!!

comment:2 Changed 4 years ago by limi

Yeah, I was surprised to see this go in without a PLIP. Normalizing from non-latin languages is complicated, and unlikely to work unless you have native speakers that build it, and test it along the way.

My instinct would be to do as originally suggested — show the short name field on non-latin languages, and wait for Unicode support in URLs for Zope before we autogenerate anything.

comment:3 Changed 4 years ago by limi

  • Component changed from Unknown to Internationalization

comment:4 Changed 4 years ago by hannosch

The whole reason for adding this change early on in the alpha process, was to gather feedback from users about the quality of the transliteration. My hope was that it is useful in general and produces better results than the former algorithm. Obviously a static mapping based on an individual character cannot produce really good results.

If the new approach turns out to be problematic for specific languages (like it seems for Japanese), we can disable it for these languages. We have a full infrastructure for language dependent normalization in plone.i18n. This is currently only used by eight languages to provide different normalization rules but is easy to extend.

For Japanese the question is if there is something we can implement that is easy to do and has more information value than the former solution. Our former hex values of Unicode codepoints are as useful as a random string.

The original description of the code ( http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm) explicitly mentions Japanese and Thai to be problematic but many other languages like Greek, Russian or Thaana to work quite well. Certainly it does work well for the small number of non-ascii characters found in Western languages.

And we still enable short names by default for all languages with a non-latin script. I think the ICU data we use to determine the script found in zope.i18n aren't all that complete, so we might miss languages. Maybe this could use a better approach. The code is in CMFPlone.setuphandlers.setupPortalContent following the "Enable visible_ids for non-latin scripts" comment.

comment:5 Changed 4 years ago by hannosch

  • Owner set to hannosch
  • Status changed from new to assigned

comment:6 Changed 4 years ago by hannosch

(In [32642]) Added new specific normalizer for Japanese, which avoids the Unidecode based transliteration. This refs #9914.

comment:7 Changed 4 years ago by hannosch

  • Status changed from assigned to closed
  • Resolution set to worksforme

I changed the normalization logic for Japanese and Thai back to the old algorithm.

comment:8 Changed 4 years ago by terapyon

(In [32840]) modify URL Normalize for Japanese. refs #9914

comment:9 Changed 4 years ago by terapyon

(In [32842]) modify a bit miss for URL Normalize for Japanese. refs #9914

Note: See TracTickets for help on using tickets.