Originally Posted by
guraknugen
The program counts characters and looks for characters in strings.
In your case, I think I would use fwide(stdin,1) if you read from standard input (and the same for any files you might read), just because it then avoids the extra mbstowcs() call. If you count the characters in command-line parameters or environment variable values, use mbstowcs().
Counting the number of characters using wide characters is much easier, and to me, makes sense here. The standard output and standard error I'd keep in narrow mode.
Originally Posted by
guraknugen
Code:
wchar_t *wide_string(const char *const s)
What's the difference between the two ”const”?
You need to read the type qualifiers from right to left, separated by asterisks (which you read as "a pointer to").
Summary:
- const char *s
s is a pointer to const char.
You can change where s points to, but you cannot change the characters stored there.
For example, s++ is allowed, but s[2] = '\0' is not allowed. - char *const s
s is a constant, a pointer to char.
You cannot change where s points to, but you can change the characters stored there.
For example, s++ is not allowed, but s[2] = '\0' is allowed. - const char *const s
s is a constant, a pointer to const char.
You cannot change where s points to, and you cannot change the characters stored there.
I use it principally to convey programmer intent. When you learn to read the qualifiers from left to right, the intent for each variable becomes immediately clear.
In some cases it also helps compilers produce better code, but nowadays compilers are darn smart about detecting whether stuff is changed or not.
Originally Posted by
guraknugen
I've been thinking about this one a little. Seems like it's checking if n==-1 after converting -1 to the size_t type, is that right? And is it really necessary to do that conversion? Shouldn't the compiler do that automatically?
Yes, to me it is, yes.
You see, size_t is an unsigned integer type, and it is documented to return exactly (size_t)-1 in case of an invalid input sequence.
I'm not absolutely certain if the integer type promotion rules work out to exactly the above in all cases, as the size of the size_t type varies compared to the size of the integer constant -1. But, I don't even care enough to find out,
You see, when I see that if (n == (size_t)-1) bit, and I remember or check that n is of type size_t, the check tells me the programmer has thought about this enough to stick that cast there, and that it should be correct -- that is, compare the value of n to exactly (size_t)-1.
So, even if the C standards would guarantee that if (n == -1) does the exact same thing, I'd still want to see that cast there, just to a little confirmation that the programmer knew exactly what value they're comparing to.
For the exact same reason, I tend to use if (n == (ssize_t)-1) when n is of type ssize_t, which is defined to be a signed type itself.
Originally Posted by
guraknugen
Could it be that it's a typo? ”if(c != n)” seems to make more sense, doesn't it?
Abso-frigging-lutely it is a typo; good catch!
Yes, it was definitely intended to be if (c != n). The purpose of the test is, of course, to verify that nothing changed between the two mbstowcs() calls.
Now that I think about it, it would be even better to replace the +1 with +2 in that function. You see, there is the possibility some implementations return size-1 if they run out of buffer space. Using +2, i.e. one extra wide character for the buffer, would mean we'd catch that case too.
See how useful it is to ask questions? I definitely do like it, because they help me catch my own goofs, and help make the code I write even better.
Originally Posted by
guraknugen
In my working example, the following is not good, is it?
In both cases you're wasting memory. (Technically, it is leaking memory, but since all memory is freed when the process quits, it's only an issue while the program runs.)
Given the discussion and notes above, here's an even better approach, using the same approach getline() uses:
Code:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <errno.h>
size_t widen(wchar_t **const dataptr, size_t *const sizeptr, const char *const narrow)
{
wchar_t *data;
size_t size, len, check;
if (!dataptr || !sizeptr || !narrow) {
errno = EINVAL;
return (size_t)0;
}
if (*dataptr) {
data = *dataptr;
size = *sizeptr;
} else {
data = NULL;
size = 0;
}
len = mbstowcs(NULL, narrow, 0);
if (len == (size_t)-1) {
errno = EILSEQ;
return (size_t)0;
}
if (len + 2 > size) {
size = (len | 127) + 129; /* Allocate at least +2; here, some extra. */
data = realloc(data, size);
if (!data) {
/* Note: *dataptr still valid and exists. */
errno = ENOMEM;
return (size_t)0;
}
*dataptr = data;
*sizeptr = size;
}
check = mbstowcs(data, narrow, size);
if (check != len) {
errno = EBUSY; /* Something changed from under us! */
return (size_t)0;
}
/* We may return 0, if narrow was empty string.
* So, set errno in all cases. */
errno = 0;
return len;
}
int main(int argc, char *argv[])
{
wchar_t *wide = NULL;
size_t size = 0;
size_t len;
int arg;
setlocale(LC_ALL, "");
for (arg = 1; arg < argc; arg++) {
len = widen(&wide, &size, argv[arg]);
if (errno) {
fflush(stdout);
fprintf(stderr, "%s: %m.\n", argv[arg]);
return EXIT_FAILURE;
}
printf("\"%s\" = %zu characters: L\"%ls\".\n", argv[arg], len, wide);
}
/* Since we are exiting, we don't need to free the wide buffer,
* but if we wanted to, or did some other work so freeing the memory
* for other stuff made sense, this is how to do it safely: */
free(wide);
wide = NULL;
size = 0;
return EXIT_SUCCESS;
}
On an Ubuntu installation, you can use the command
Code:
awk '(NF >= 2 && $1 !~ /#/) { print $2 }' /usr/share/i18n/SUPPORTED | sort | uniq
to list all the character sets that are supported for locales. Although normally you should just install a language pack, you can just install a specific locale for testing. To find the locale names that use an interesting character set, say GB2312, use
Code:
awk '($2=="GB2312") { print $1 }' /usr/share/i18n/SUPPORTED
Then, install one of those locales, minimally (no language packs or anything, just the locale for testing here) using
Code:
sudo sh -c 'locale-gen zh_SG ; update-locale'
and test with the above program (I'm assuming you compiled it to example) using say
Code:
LANG=zh_SG.GB2312 LC_ALL=zh_SG.GB2312 ./example "`printf '你好世界\n' | iconv -t GB2312`"
Since your terminal is in UTF-8, you'll see something like
"��������" = 4 characters: L"��������".
but you could amend the stanza to
Code:
LANG=zh_SG.GB2312 LC_ALL=zh_SG.GB2312 ./example "`printf '你好世界\n' | iconv -t GB2312`" | iconv -f GB2312
but then the output is the obvious
"你好世界" = 4 characters: L"你好世界"
and it's not at all clear that it used GB2312 internally; you'd have to trust the command. Dropping the latter iconv lets you verify it's not sneakily using UTF-8 behind your back, because the output makes no sense (in UTF-8, that your terminal is using).