From vinschen at redhat.com Tue Feb 9 16:02:12 2010 From: vinschen at redhat.com (Corinna Vinschen) Date: Tue, 9 Feb 2010 15:02:12 +0100 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20091227163246.A40C15654F@rebar.astron.com> References: <20091227124036.GD27407@calimero.vinschen.de> <20091227163246.A40C15654F@rebar.astron.com> Message-ID: <20100209140212.GY28659@calimero.vinschen.de> Hi Christos, On Dec 27 11:32, Christos Zoulas wrote: > On Dec 27, 1:40pm, vinschen at redhat.com (Corinna Vinschen) wrote: > -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t > > | > Yes, we should detect another case where sizeof(wchar_t) == 2 and then > | > make Char int32_t, and deal with converting back and forth between Char > | > and wchar_t the same way we used to convert when using short string > | > (between Char (int16_t) and char). > | > | Do we really need another case? Wouldn't it be sufficient to redefine > | SHORT_STRINGS to be that case? It might be helpful to define Char as > | wint_t in this case. wint_t is typically defined as 4 byte unsigned on > | a UTF-16 system anyway, otherwise there's no way to define WEOF. > > It might be simpler, I don't know. Below you find the patch I applied to tcsh to make WIDE_STRINGS working for systems with sizeof (wchar_t) == 2. It was rather simple, actually. The groundwork laid by the SHORT_STRINGS stuff was usable almost without changes for UTF-16 systems. We now have a forth case, which is UTF16_STRINGS. It's only defined if WIDE_STRINGS is defined as well, and then only if sizeof(wchar_t) < 4. It shares most of the code with WIDE_STRINGS and SHORT_STRINGS, just a few places needed a bit of tweaking. Since Char is wint_t, not wchar_t, some places just need a cast. Two exceptions: - rt_mbtowc needs a temporary wchar_t to store the actual wide char returned by wctomb to make this code endianness-agnostic. - In s_strcasecmp I changed the type of the temporary variables l1 and l2 to wint_t unconditionally. The reason is that this should be always the right thing to do, given that the towlower returns a win_t anyway. I'm not exactly sure that this will work in UTF-16 cases where surrogate pairs are affected, but at least for the base plane it should work fine. If somebody stumbles over a problem with surrogate pairs, I'd be glad to fix it. Would you mind to check this patch in? Thanks, Corinna * config_f.h (WIDE_STRINGS): Define independently of the size of wchar_t. Remove #error if sizeof (wchar_t) < 4. (UTF16_STRINGS): Define if sizeof (wchar_t) < 4. * sh.file.c (compare): Disable WIDE_STRINGS code for UTF16_STRING targets. * sh.h (Char): Define as wint_t on UTF16_STRING targets. Disable Str functions to wcs functions mapping on UTF16_STRING targets. * tc.decls.h (one_mbtowc): Declare with Char instead of wchar_t. (one_wctomb): Ditto. (rt_mbtowc): Ditto. * tc.nls.c (NLSWidth): Cast Char to wchar_t in call to wcwidth. (NLSStringWidth): Ditto in call to wcswidth. * tc.str.c (one_mbtowc): Define with Char instead of wchar_t. (one_wctomb): Ditto. (rt_mbtowc): Ditto. Accommodate different size of wchar_t vs. wint_t on UTF16_STRINGS targets. Enable s_XXX string functions on UTF16_STRINGS targets. Index: config_f.h =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/config_f.h,v retrieving revision 3.42 diff -u -p -r3.42 config_f.h --- config_f.h 25 Jun 2009 12:10:56 -0000 3.42 +++ config_f.h 9 Feb 2010 13:55:21 -0000 @@ -50,8 +50,11 @@ * WIDE_STRINGS Represent strings using wide characters * Allows proper function in multibyte encodings like UTF-8 */ -#if defined (SHORT_STRINGS) && defined (NLS) && SIZEOF_WCHAR_T >= 4 && defined (HAVE_MBRTOWC) && !defined (WINNT_NATIVE) && !defined(_OSD_POSIX) +#if defined (SHORT_STRINGS) && defined (NLS) && defined (HAVE_MBRTOWC) && !defined (WINNT_NATIVE) && !defined(_OSD_POSIX) # define WIDE_STRINGS +# if SIZEOF_WCHAR_T < 4 +# define UTF16_STRINGS +# endif #endif /* @@ -197,10 +200,6 @@ /* Consistency checks */ #ifdef WIDE_STRINGS -# if SIZEOF_WCHAR_T < 4 - #error "wchar_t must be at least 4 bytes for WIDE_STRINGS" -# endif - # ifdef WINNT_NATIVE #error "WIDE_STRINGS cannot be used together with WINNT_NATIVE" # endif Index: sh.file.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/sh.file.c,v retrieving revision 3.36 diff -u -p -r3.36 sh.file.c --- sh.file.c 5 Jul 2007 14:13:06 -0000 3.36 +++ sh.file.c 9 Feb 2010 13:55:21 -0000 @@ -594,7 +594,7 @@ again: /* search for matches */ static int compare(const void *p, const void *q) { -#ifdef WIDE_STRINGS +#if defined (WIDE_STRINGS) && !defined (UTF16_STRING) errno = 0; return (wcscoll(*(Char *const *) p, *(Char *const *) q)); Index: sh.h =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/sh.h,v retrieving revision 3.155 diff -u -p -r3.155 sh.h --- sh.h 26 Jan 2010 20:03:17 -0000 3.155 +++ sh.h 9 Feb 2010 13:55:21 -0000 @@ -89,7 +89,11 @@ typedef unsigned long intptr_t; #ifdef SHORT_STRINGS # ifdef WIDE_STRINGS #include +# ifdef UTF16_STRINGS +typedef wint_t Char; +# else typedef wchar_t Char; +#endif typedef unsigned long uChar; typedef wint_t eChar; /* Can contain any Char value or CHAR_ERR */ #define CHAR_ERR WEOF /* Pretty please, use bit 31... */ @@ -1099,7 +1103,7 @@ EXTERN Char PRCHROOT; /* Prompt symbo #define short2blk(a) saveblk(a) #define short2str(a) caching_strip(a) #else -#ifdef WIDE_STRINGS +#ifndef UTF16_STRINGS #define Strchr(a, b) wcschr(a, b) #define Strrchr(a, b) wcsrchr(a, b) #define Strcat(a, b) wcscat(a, b) Index: tc.decls.h =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/tc.decls.h,v retrieving revision 3.64 diff -u -p -r3.64 tc.decls.h --- tc.decls.h 14 May 2008 20:10:30 -0000 3.64 +++ tc.decls.h 9 Feb 2010 13:55:21 -0000 @@ -259,9 +259,9 @@ extern void sched_run (void); * tc.str.c: */ #ifdef WIDE_STRINGS -extern size_t one_mbtowc (wchar_t *, const char *, size_t); -extern size_t one_wctomb (char *, wchar_t); -extern int rt_mbtowc (wchar_t *, const char *, size_t); +extern size_t one_mbtowc (Char *, const char *, size_t); +extern size_t one_wctomb (char *, Char); +extern int rt_mbtowc (Char *, const char *, size_t); #else #define one_mbtowc(PWC, S, N) \ ((void)(N), *(PWC) = (unsigned char)*(S), (size_t)1) Index: tc.nls.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/tc.nls.c,v retrieving revision 3.21 diff -u -p -r3.21 tc.nls.c --- tc.nls.c 26 Sep 2006 16:45:30 -0000 3.21 +++ tc.nls.c 9 Feb 2010 13:55:21 -0000 @@ -42,7 +42,7 @@ NLSWidth(Char c) int l; if (c & INVALID_BYTE) return 1; - l = wcwidth(c); + l = wcwidth((wchar_t) c); return l >= 0 ? l : 0; # else return iswprint(c) != 0; @@ -58,7 +58,7 @@ NLSStringWidth(const Char *s) while (*s) { c = *s++; #ifdef HAVE_WCWIDTH - if ((l = wcwidth(c)) < 0) + if ((l = wcwidth((wchar_t) c)) < 0) l = 2; #else l = iswprint(c) != 0; Index: tc.str.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/tc.str.c,v retrieving revision 3.30 diff -u -p -r3.30 tc.str.c --- tc.str.c 25 Jun 2009 21:27:38 -0000 3.30 +++ tc.str.c 9 Feb 2010 13:55:21 -0000 @@ -46,7 +46,7 @@ RCSID("$tcsh: tc.str.c,v 3.30 2009/06/25 #ifdef WIDE_STRINGS size_t -one_mbtowc(wchar_t *pwc, const char *s, size_t n) +one_mbtowc(Char *pwc, const char *s, size_t n) { int len; @@ -61,7 +61,7 @@ one_mbtowc(wchar_t *pwc, const char *s, } size_t -one_wctomb(char *s, wchar_t wchar) +one_wctomb(char *s, Char wchar) { int len; @@ -69,7 +69,7 @@ one_wctomb(char *s, wchar_t wchar) s[0] = wchar & 0xFF; len = 1; } else { - len = wctomb(s, wchar); + len = wctomb(s, (wchar_t) wchar); if (len == -1) s[0] = wchar; if (len <= 0) @@ -79,14 +79,24 @@ one_wctomb(char *s, wchar_t wchar) } int -rt_mbtowc(wchar_t *pwc, const char *s, size_t n) +rt_mbtowc(Char *pwc, const char *s, size_t n) { int ret; char back[MB_LEN_MAX]; +#ifdef UTF16_STRINGS + wchar_t tmp; + ret = mbtowc(&tmp, s, n); +#else ret = mbtowc(pwc, s, n); - if (ret > 0 && (wctomb(back, *pwc) != ret || memcmp(s, back, ret) != 0)) - ret = -1; +#endif + if (ret > 0) { +#ifdef UTF16_STRINGS + *pwc = tmp; +#endif + if (wctomb(back, *pwc) != ret || memcmp(s, back, ret) != 0) + ret = -1; + } return ret; } #endif @@ -186,7 +196,7 @@ short2str(const Char *src) return (sdst); } -#ifndef WIDE_STRINGS +#if !defined (WIDE_STRINGS) || defined (UTF16_STRINGS) Char * s_strcpy(Char *dst, const Char *src) { @@ -334,7 +344,7 @@ int s_strcasecmp(const Char *str1, const Char *str2) { #ifdef WIDE_STRINGS - wchar_t l1 = 0, l2 = 0; + wint_t l1 = 0, l2 = 0; for (; *str1 && ((*str1 == *str2 && (l1 = l2 = 0) == 0) || (l1 = towlower(*str1)) == (l2 = towlower(*str2))); str1++, str2++) continue; -- Corinna Vinschen Cygwin Project Co-Leader Red Hat From christos at zoulas.com Tue Feb 9 22:18:50 2010 From: christos at zoulas.com (Christos Zoulas) Date: Tue, 9 Feb 2010 15:18:50 -0500 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100209140212.GY28659@calimero.vinschen.de> from Corinna Vinschen (Feb 9, 3:02pm) Message-ID: <20100209201850.56CF95654E@rebar.astron.com> On Feb 9, 3:02pm, vinschen at redhat.com (Corinna Vinschen) wrote: -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t | Hi Christos, | | On Dec 27 11:32, Christos Zoulas wrote: | > On Dec 27, 1:40pm, vinschen at redhat.com (Corinna Vinschen) wrote: | > -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t | > | > | > Yes, we should detect another case where sizeof(wchar_t) == 2 and then | > | > make Char int32_t, and deal with converting back and forth between Char | > | > and wchar_t the same way we used to convert when using short string | > | > (between Char (int16_t) and char). | > | | > | Do we really need another case? Wouldn't it be sufficient to redefine | > | SHORT_STRINGS to be that case? It might be helpful to define Char as | > | wint_t in this case. wint_t is typically defined as 4 byte unsigned on | > | a UTF-16 system anyway, otherwise there's no way to define WEOF. | > | > It might be simpler, I don't know. | | Below you find the patch I applied to tcsh to make WIDE_STRINGS working | for systems with sizeof (wchar_t) == 2. | | It was rather simple, actually. The groundwork laid by the SHORT_STRINGS | stuff was usable almost without changes for UTF-16 systems. We now have | a forth case, which is UTF16_STRINGS. It's only defined if WIDE_STRINGS | is defined as well, and then only if sizeof(wchar_t) < 4. It shares | most of the code with WIDE_STRINGS and SHORT_STRINGS, just a few places | needed a bit of tweaking. Since Char is wint_t, not wchar_t, some | places just need a cast. Two exceptions: | | - rt_mbtowc needs a temporary wchar_t to store the actual wide char | returned by wctomb to make this code endianness-agnostic. | | - In s_strcasecmp I changed the type of the temporary variables l1 and | l2 to wint_t unconditionally. The reason is that this should be always | the right thing to do, given that the towlower returns a win_t anyway. | | I'm not exactly sure that this will work in UTF-16 cases where surrogate | pairs are affected, but at least for the base plane it should work fine. | If somebody stumbles over a problem with surrogate pairs, I'd be glad | to fix it. | | Would you mind to check this patch in? | Not at all! Very nicely done, and I am really glad it was that simple! Best, christos From vinschen at redhat.com Wed Feb 10 11:08:38 2010 From: vinschen at redhat.com (Corinna Vinschen) Date: Wed, 10 Feb 2010 10:08:38 +0100 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100209201850.56CF95654E@rebar.astron.com> References: <20100209140212.GY28659@calimero.vinschen.de> <20100209201850.56CF95654E@rebar.astron.com> Message-ID: <20100210090838.GA28659@calimero.vinschen.de> On Feb 9 15:18, Christos Zoulas wrote: > On Feb 9, 3:02pm, vinschen at redhat.com (Corinna Vinschen) wrote: > | Below you find the patch I applied to tcsh to make WIDE_STRINGS working > | for systems with sizeof (wchar_t) == 2. > | [...] > | - rt_mbtowc needs a temporary wchar_t to store the actual wide char > | returned by wctomb to make this code endianness-agnostic. > | [...] > | Would you mind to check this patch in? > > Not at all! Very nicely done, and I am really glad it was that simple! Thanks! Come to think of it, I'm wondering if it's really necessary to special-case UTF16_STRINGS in rt_mbtowc. This code int rt_mbtowc(Char *pwc, const char *s, size_t n) { int ret; char back[MB_LEN_MAX]; wchar_t tmp; ret = mbtowc(&tmp, s, n); if (ret > 0) { *pwc = tmp; if (wctomb(back, *pwc) != ret || memcmp(s, back, ret) != 0) ret = -1; } return ret; } is as correct on sizeof(wchar_t) == 4 systems as it is on sizeof(wchar_t) == 2 systems. Therefore I'd like to propose to remove the #ifdef's, since every #ifdef is just a source of conditonal confusion. Index: tc.str.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/tc.str.c,v retrieving revision 3.31 diff -u -p -r3.31 tc.str.c --- tc.str.c 9 Feb 2010 20:20:09 -0000 3.31 +++ tc.str.c 10 Feb 2010 08:55:15 -0000 @@ -83,17 +83,11 @@ rt_mbtowc(Char *pwc, const char *s, size { int ret; char back[MB_LEN_MAX]; -#ifdef UTF16_STRINGS wchar_t tmp; ret = mbtowc(&tmp, s, n); -#else - ret = mbtowc(pwc, s, n); -#endif if (ret > 0) { -#ifdef UTF16_STRINGS *pwc = tmp; -#endif if (wctomb(back, *pwc) != ret || memcmp(s, back, ret) != 0) ret = -1; } Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat From christos at zoulas.com Wed Feb 10 15:30:28 2010 From: christos at zoulas.com (Christos Zoulas) Date: Wed, 10 Feb 2010 08:30:28 -0500 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100210090838.GA28659@calimero.vinschen.de> from Corinna Vinschen (Feb 10, 10:08am) Message-ID: <20100210133028.9F5B65654F@rebar.astron.com> On Feb 10, 10:08am, vinschen at redhat.com (Corinna Vinschen) wrote: -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t | Come to think of it, I'm wondering if it's really necessary to | special-case UTF16_STRINGS in rt_mbtowc. This code | | int | rt_mbtowc(Char *pwc, const char *s, size_t n) | { | int ret; | char back[MB_LEN_MAX]; | wchar_t tmp; | | ret = mbtowc(&tmp, s, n); | if (ret > 0) { | *pwc = tmp; | if (wctomb(back, *pwc) != ret || memcmp(s, back, ret) != 0) | ret = -1; | } | return ret; | } | | is as correct on sizeof(wchar_t) == 4 systems as it is on | sizeof(wchar_t) == 2 systems. Therefore I'd like to propose to remove | the #ifdef's, since every #ifdef is just a source of conditonal | confusion. Committed, thanks! christos From vinschen at redhat.com Sat Feb 13 00:08:48 2010 From: vinschen at redhat.com (Corinna Vinschen) Date: Fri, 12 Feb 2010 23:08:48 +0100 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100210133028.9F5B65654F@rebar.astron.com> References: <20100210090838.GA28659@calimero.vinschen.de> <20100210133028.9F5B65654F@rebar.astron.com> Message-ID: <20100212220848.GG5683@calimero.vinschen.de> On Feb 10 08:30, Christos Zoulas wrote: > On Feb 10, 10:08am, Corinna Vinschen wrote: > -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t > > | Come to think of it, I'm wondering if it's really necessary to > | special-case UTF16_STRINGS in rt_mbtowc. > | [...] > | Therefore I'd like to propose to remove > | the #ifdef's, since every #ifdef is just a source of conditonal > | confusion. > > Committed, thanks! Here's another addition. I think this is all which is needed to handle UTF-16 surrogates. Time will tell :-} Corinna * tc.nls.c (xwcwidth): New function for UTF-16 systems. Just define as wcwidth on UTF-32 systems. (NLSWidth): Call xwcwidth here. (NLSStringWidth): Ditto. * tc.str.c (one_wctomb): Handle characters outside the base plane manually on UTF-16 systems. (rt_mbtowc): Call mbrtowc instead of mbtowc to have the state information for surrogate handling. On UTF-16 systems, convert surrogate pairs to UTF-32 values. Index: tc.nls.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/tc.nls.c,v retrieving revision 3.22 diff -u -p -r3.22 tc.nls.c --- tc.nls.c 9 Feb 2010 20:20:59 -0000 3.22 +++ tc.nls.c 12 Feb 2010 22:03:28 -0000 @@ -34,7 +34,31 @@ RCSID("$tcsh: tc.nls.c,v 3.22 2010/02/09 20:20:59 christos Exp $") + #ifdef WIDE_STRINGS +# ifdef HAVE_WCWIDTH +# ifdef UTF16_STRINGS +int +xwcwidth (wint_t wchar) +{ + wchar_t ws[2]; + + if (wchar <= 0xffff) + return wcwidth ((wchar_t) wchar); + /* UTF-16 systems can't handle these values directly in calls to wcwidth. + However, they can handle them as surrogate pairs in calls to wcswidth. + What we do here is to convert UTF-32 values >= 0x10000 into surrogate + pairs and compute the width by calling wcswidth. */ + wchar -= 0x10000; + ws[0] = 0xd800 | (wchar >> 10); + ws[1] = 0xdc00 | (wchar & 0x3ff); + return wcswidth (ws, 2); +} +# else +#define xwcwidth wcwidth +# endif /* !UTF16_STRINGS */ +# endif /* HAVE_WCWIDTH */ + int NLSWidth(Char c) { @@ -42,7 +66,7 @@ NLSWidth(Char c) int l; if (c & INVALID_BYTE) return 1; - l = wcwidth((wchar_t) c); + l = xwcwidth((wchar_t) c); return l >= 0 ? l : 0; # else return iswprint(c) != 0; @@ -58,7 +82,7 @@ NLSStringWidth(const Char *s) while (*s) { c = *s++; #ifdef HAVE_WCWIDTH - if ((l = wcwidth((wchar_t) c)) < 0) + if ((l = xwcwidth((wchar_t) c)) < 0) l = 2; #else l = iswprint(c) != 0; Index: tc.str.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/tc.str.c,v retrieving revision 3.32 diff -u -p -r3.32 tc.str.c --- tc.str.c 10 Feb 2010 13:29:57 -0000 3.32 +++ tc.str.c 12 Feb 2010 22:03:28 -0000 @@ -69,6 +69,19 @@ one_wctomb(char *s, Char wchar) s[0] = wchar & 0xFF; len = 1; } else { +#ifdef UTF16_STRINGS + if (wchar >= 0x10000) { + /* UTF-16 systems can't handle these values directly. Since the + rest of the code assumes UTF-32, we handle this here, + encapsulated in one_wctomb and rt_mbtowc. See there for + the inverse operation. */ + *s++ = 0xf0 | ((wchar & 0x1c0000) >> 18); + *s++ = 0x80 | ((wchar & 0x3f000) >> 12); + *s++ = 0x80 | ((wchar & 0xfc0) >> 6); + *s = 0x80 | (wchar & 0x3f); + return 4; + } +#endif len = wctomb(s, (wchar_t) wchar); if (len == -1) s[0] = wchar; @@ -84,10 +97,25 @@ rt_mbtowc(Char *pwc, const char *s, size int ret; char back[MB_LEN_MAX]; wchar_t tmp; + mbstate_t mb; - ret = mbtowc(&tmp, s, n); + memset (&mb, 0, sizeof mb); + ret = mbrtowc(&tmp, s, n, &mb); if (ret > 0) { *pwc = tmp; +#ifdef UTF16_STRINGS + if (tmp >= 0xd800 && tmp <= 0xdbff) { + /* UTF-16 surrogate pair. Fetch second half and compute + UTF-32 value. Dispense with the inverse test in this case. */ + size_t n2 = mbrtowc(&tmp, s + ret, n - ret, &mb); + if (n2 == 0 || n2 == (size_t)-1 || n2 == (size_t)-2) + ret = -1; + else { + *pwc = (((*pwc & 0x3ff) << 10) | (tmp & 0x3ff)) + 0x10000; + ret += n2; + } + } else +#endif if (wctomb(back, *pwc) != ret || memcmp(s, back, ret) != 0) ret = -1; } -- Corinna Vinschen Cygwin Project Co-Leader Red Hat From christos at zoulas.com Sat Feb 13 00:27:59 2010 From: christos at zoulas.com (Christos Zoulas) Date: Fri, 12 Feb 2010 17:27:59 -0500 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100212220848.GG5683@calimero.vinschen.de> from Corinna Vinschen (Feb 12, 11:08pm) Message-ID: <20100212222759.19A3F56550@rebar.astron.com> On Feb 12, 11:08pm, vinschen at redhat.com (Corinna Vinschen) wrote: -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t | On Feb 10 08:30, Christos Zoulas wrote: | > On Feb 10, 10:08am, Corinna Vinschen wrote: | > -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t | > | > | Come to think of it, I'm wondering if it's really necessary to | > | special-case UTF16_STRINGS in rt_mbtowc. | > | [...] | > | Therefore I'd like to propose to remove | > | the #ifdef's, since every #ifdef is just a source of conditonal | > | confusion. | > | > Committed, thanks! | | Here's another addition. I think this is all which is needed to handle | UTF-16 surrogates. Time will tell :-} | | | Corinna | Committed, thanks! christos From vinschen at redhat.com Sat Feb 13 14:22:59 2010 From: vinschen at redhat.com (Corinna Vinschen) Date: Sat, 13 Feb 2010 13:22:59 +0100 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100212222759.19A3F56550@rebar.astron.com> References: <20100212220848.GG5683@calimero.vinschen.de> <20100212222759.19A3F56550@rebar.astron.com> Message-ID: <20100213122259.GL5683@calimero.vinschen.de> On Feb 12 17:27, Christos Zoulas wrote: > On Feb 12, 11:08pm, Corinna Vinschen wrote: > | Here's another addition. I think this is all which is needed to handle > | UTF-16 surrogates. Time will tell :-} > > Committed, thanks! Sorry, but the patch was not well thought out. The conversion of Unicode values >= 0x10000 to the multibyte representation only makes sense if the current codeset is UTF-8, of course. Unfortunately there are other multibyte representations containing values from the Unicode area beyond the base plane (GB18030). Therefore, I'd like to suggest the following fix: Index: tc.str.c =================================================================== RCS file: /p/tcsh/cvsroot/tcsh/tc.str.c,v retrieving revision 3.33 diff -u -p -r3.33 tc.str.c --- tc.str.c 12 Feb 2010 22:18:20 -0000 3.33 +++ tc.str.c 13 Feb 2010 12:21:43 -0000 @@ -71,16 +71,19 @@ one_wctomb(char *s, Char wchar) } else { #ifdef UTF16_STRINGS if (wchar >= 0x10000) { - /* UTF-16 systems can't handle these values directly. Since the - rest of the code assumes UTF-32, we handle this here, - encapsulated in one_wctomb and rt_mbtowc. See there for - the inverse operation. */ - *s++ = 0xf0 | ((wchar & 0x1c0000) >> 18); - *s++ = 0x80 | ((wchar & 0x3f000) >> 12); - *s++ = 0x80 | ((wchar & 0xfc0) >> 6); - *s = 0x80 | (wchar & 0x3f); - return 4; - } + /* UTF-16 systems can't handle these values directly in calls to + wctomb. Convert value to UTF-16 surrogate and call wcstombs to + convert the "string" to the correct multibyte representation, + if any. */ + wchar_t ws[3]; + wchar -= 0x10000; + ws[0] = 0xd800 | (wchar >> 10); + ws[1] = 0xdc00 | (wchar & 0x3ff); + ws[2] = 0; + /* The return value of wcstombs excludes the trailing 0, so len is + the correct number of multibytes for the Unicode char. */ + len = wcstombs (s, ws, MB_CUR_MAX + 1); + } else #endif len = wctomb(s, (wchar_t) wchar); if (len == -1) Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat From christos at zoulas.com Mon Feb 15 02:46:44 2010 From: christos at zoulas.com (Christos Zoulas) Date: Sun, 14 Feb 2010 19:46:44 -0500 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100213122259.GL5683@calimero.vinschen.de> from Corinna Vinschen (Feb 13, 1:22pm) Message-ID: <20100215004644.EBC3456554@rebar.astron.com> On Feb 13, 1:22pm, vinschen at redhat.com (Corinna Vinschen) wrote: -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t | Sorry, but the patch was not well thought out. The conversion of | Unicode values >= 0x10000 to the multibyte representation only makes | sense if the current codeset is UTF-8, of course. Unfortunately there | are other multibyte representations containing values from the Unicode | area beyond the base plane (GB18030). Therefore, I'd like to suggest | the following fix: Committed that one too. christos From vinschen at redhat.com Mon Feb 15 11:37:52 2010 From: vinschen at redhat.com (Corinna Vinschen) Date: Mon, 15 Feb 2010 10:37:52 +0100 Subject: Correct handling of wide chars on systems with sizeof(wchar_t)==2? In-Reply-To: <20100215004644.EBC3456554@rebar.astron.com> References: <20100213122259.GL5683@calimero.vinschen.de> <20100215004644.EBC3456554@rebar.astron.com> Message-ID: <20100215093752.GX5683@calimero.vinschen.de> On Feb 14 19:46, Christos Zoulas wrote: > On Feb 13, 1:22pm, vinschen at redhat.com (Corinna Vinschen) wrote: > -- Subject: Re: Correct handling of wide chars on systems with sizeof(wchar_t > > | Sorry, but the patch was not well thought out. The conversion of > | Unicode values >= 0x10000 to the multibyte representation only makes > | sense if the current codeset is UTF-8, of course. Unfortunately there > | are other multibyte representations containing values from the Unicode > | area beyond the base plane (GB18030). Therefore, I'd like to suggest > | the following fix: > > Committed that one too. Thanks! I hope I catched it all now. Corinna -- Corinna Vinschen Cygwin Project Co-Leader Red Hat