[darcs-users] Latin vs. Unicode
Ben Franksen
ben.franksen at online.de
Mon Nov 17 21:44:32 UTC 2014
Stephen J. Turnbull wrote:
> Ben Franksen writes:
> > Over the last years, unicode has established itself world-wide and
> > firmly and is well supported by all the major operating systems. This
> > is why I vote for dropping support for older 8-bit encodings that are
> > not unicode compatible, thereby allowing e.g. Chinese users to use
> > Darcs with their native languages.
>
> Does "just dropping 8-bit support" actually enable that, or does it
> only work in a .UTF8 locale?
I am sorry, I should have replied to myself earlier:
Contrary to what I said, it is absolutely not necessary to drop support for
non-Unicode encodings -- in principle. All we need to require is that there
is a loss-less conversion from the text (of which we assume it is encoded in
the current locale) to Unicode for input, and the reverse for output.
Ganesh explained to me why implementing this cleanly would mean a lot of
effort. The reason is that for a long time Darcs had to work around broken
IO standard libraries that did not consider encodings at all and simply
assumed 1:1 correspondence between Char and byte.
> Or does it even work at all? I have
> trouble imagining how a random 8 bit encoding would get passed in
> verbatim to a widechar Unicode string, which can then be cast to an
> 8-bit encoding that actually comes out the way it went in.
This is exactly what Darcs is currently trying to do and what I want to get
rid of. And which, of course, breaks as soon as your text translates to code
points outside of the 8 bit range, which western Europeans tend not to
notice since their characters mostly lie inside the 8 bit range.
> 8-bit
> encodings (including Latin-1) must be recoded to Unicode, or they
> probably violate the UTF-8 format
Sure. But since at least ghc-7.4 the IO libraries have been fixed and
correctly encode stuff according to the current locale. So we already get
everything the user enters properly decoded to String.
Except that Darcs, in order not to destroy all its code based on previously
necessary work-arounds, forces the IO libraries into a compatibility mode in
which it thinks the user has a "char8" encoding which means "convert every
byte verbatim to Char w/o trying to de- or encode".
> (eg, the sequence ASCII-characters
> latin-1-character ASCII-character can never be valid UTF-8, but it's
> extremely common in Latin-1 text).
>
> Nor do I think you can count on command lines having a .UTF-8 locale.
> Shift JIS and to some extent EUC-JP remain popular in Japan, and at
> least my Chinese students frequently use Big5 and the GB family or
> encodings. All of these have repertoires that are Unicode subsets,
> but the encodings are different. Users expect to be able to "cat"
> them to the terminal and read them, and for that use case they will
> have a locale that specifies a default charset other than UTF-8. Most
> terminals are not able to switch encodings on the fly, so this can be
> extremely inconvenient.
We are in violent agreement here.
> I'm not saying it's not worth doing, but be prepared for quite a bit
> more work than "just dropping 8-bit support."
You are most certainly right that fixing the encoding stuff in Darcs
properly will be a lot of work.
Cheers
Ben
--
"Make it so they have to reboot after every typo." -- Scott Adams
More information about the darcs-users
mailing list