[darcs-users] [patch639] Utf-8 encoding for darcs send (reloaded)

Gabriel Kerneis kerneis at pps.jussieu.fr
Tue Jul 19 11:54:39 UTC 2011


Hi,

before anything else: should I "rebase" (unrecord/record) my patch and make a
new submission?  I guess it is not necessary to keep track of every iteration
of this bundle, but the wiki suggests that follow-up patches are prefered, so
I'm a bit lost.

On Mon, Jul 18, 2011 at 04:01:43PM +0200, Florent Becker wrote:
> > this is a third attempt at using utf-8 whenever possible in darcs send.
> I don't understand the point of this bundle: what problems are solved by
> always assuming the mail content is utf-8?

The problem solved is that, currently, darcs always assumes the mail content is
ascii.

> Are we unable to reliably decode encoding,

Yes.  In fact, it might happen that there is no good encoding at all (for
instance, several commit messages recorded with different locales, merged in
the bundle list generated by darcs send).

> or is there a deeper problem? If it's just a problem of detecting the
> encoding, I'd rather have:
> 
> 
> > - if mail is made of ascii characters, send with content-type charset
> >   set to ascii and ignore locale completely,
> ok
> 
> > - if mail contains invalid utf-8 characters, propose to either abort
> >   (and save mail content in a file) or ignore the error and send with
> >   content-type charset set to utf-8 anyway (no support for other
> >   charset when sending),
> 
> ask the user to abort or input the encoding at that point (defaulting to
> utf-8 or best, locale)

Two issues here:
- I wouldn't trust the user inputing a valid mime encoding interactively.
  Without checking rfc2046, would you chose iso8859-1 or iso-8859-1?  Or is it
  spelled iso8859-15?  Hmm, or maybe latin1?
- defaulting to locale means maintaining a correspondance table between the
  gazillions of locales out there and the corresponding mime charset.  I do not
  think such a thing belongs to darcs.  And most people use utf-8, or should
  ;-), or at least are able to convert to utf-8 when needed (according to
  Stephen, who knows much more about japanese users than I do).

Thinking about it, we might default to charset=unknown-8bit, as defined in
rfc1428.  This RFC states "This character set is not intended to be used by
mail composers." and "The use of the "unknown-8bit" label is intended only by
mail gateway agents which cannot determine via out-of-band information the
intended character set." but it is better stating we do not know rather than
stating something obviously wrong.

Does someone see good reasons not to use unknown-8bit?  I'll switch to it
otherwise (and refactor things a bit in the meantime).

[We could also add --charset flag to darcs send, and let people set it in
.darcs/defaults, at their own risk.  Yet another flag, though.]

> > - if mail is valid utf-8, send with content-type charset set to utf-8;
> >   additionnally print a warning if current locale is not recognised as
> >   utf-8 (but do not propose to abort, assuming what looks like utf-8 is
> >   utf-8).
> > 
> ok if there still is a way to abort at that point by using ctrl-C, else
> the warning is just a pied-de-nez to the user, and that's undarcsish.

This is a potential issue, but I respectfully disagree: making it possible to
abort would be possible, but if the mail is valid utf-8, it is extremely likely
(say, 99.9%) that it is intended to be utf-8; when the locale differs, it is
most probably because the user knows darcs needs utf-8 formatting and saved the
content accordingly, and not because he or she has a name which happens to be a
valid utf-8 sequence when written in some random encoding.

What is the most undarcsish: bothering >99.9% of users using a non-utf-8 locale
with yet another extra confirmation prompt, even though they took care of
tweaking their setup to save the message as utf-8 [*], or assume everything
will be okay and let the very few users in the world affected by the issue
resend their patches with a properly encoded message?

Such users shouldn't be bitten more than once, and then again the integrity of
the patch is preserved.  We are only talking about a harmless mistake in the
message encoding, which currently is *always* ascii and nobody seems to have
shouted about it before me.

(I really expect this coincidence to never happen.  Even in japanese.  But I
know too little about iso-2022 and japanese users in general, that's why I
decided to trust Stephen's advise and print a warning.)

[*] on the other hand, one might expect such users to change their locale too.
    But Stephen seemed to imply that this is not always possible, for some reasons
    that I gave up to understand.

Best,
-- 
Gabriel


More information about the darcs-users mailing list