Segmentation fault

**guraknugen** · 05-01-2015

Oh, another one. The questions never end, it seems…

In my working example, the following is not good, is it?

Code:

    for(int i=1; i<=2; i++)
        printf("%ls\n", ToWide(argv[i]));
    return EXIT_SUCCESS;

My thought (why this is a bad idea) is that I should free the memory before quitting, but now there's no pointer to it…
I guess this is better:

Code:

    wchar_t *Wide;
    for(int i=1; i<=2; i++) {
        Wide=ToWide(argv[i]));
        printf("%ls\n", Wide);
    }
    free(Wide);
    return EXIT_SUCCESS;

Or am I completely dizzy now?

**Nominal Animal** · 05-01-2015

Originally Posted by guraknugen

The program counts characters and looks for characters in strings.

In your case, I think I would use fwide(stdin,1) if you read from standard input (and the same for any files you might read), just because it then avoids the extra mbstowcs() call. If you count the characters in command-line parameters or environment variable values, use mbstowcs().

Counting the number of characters using wide characters is much easier, and to me, makes sense here. The standard output and standard error I'd keep in narrow mode.

Originally Posted by guraknugen

Code:

wchar_t *wide_string(const char *const s)

What's the difference between the two ”const”?

You need to read the type qualifiers from right to left, separated by asterisks (which you read as "a pointer to").

Summary:

const char *s
s is a pointer to const char.
You can change where s points to, but you cannot change the characters stored there.
For example, s++ is allowed, but s[2] = '\0' is not allowed.
char *const s
s is a constant, a pointer to char.
You cannot change where s points to, but you can change the characters stored there.
For example, s++ is not allowed, but s[2] = '\0' is allowed.
const char *const s
s is a constant, a pointer to const char.
You cannot change where s points to, and you cannot change the characters stored there.

I use it principally to convey programmer intent. When you learn to read the qualifiers from left to right, the intent for each variable becomes immediately clear.
In some cases it also helps compilers produce better code, but nowadays compilers are darn smart about detecting whether stuff is changed or not.

Originally Posted by guraknugen

Code:

if (n==(size_t)-1)

I've been thinking about this one a little. Seems like it's checking if n==-1 after converting -1 to the size_t type, is that right? And is it really necessary to do that conversion? Shouldn't the compiler do that automatically?

Yes, to me it is, yes.

You see, size_t is an unsigned integer type, and it is documented to return exactly (size_t)-1 in case of an invalid input sequence.

I'm not absolutely certain if the integer type promotion rules work out to exactly the above in all cases, as the size of the size_t type varies compared to the size of the integer constant -1. But, I don't even care enough to find out,

You see, when I see that if (n == (size_t)-1) bit, and I remember or check that n is of type size_t, the check tells me the programmer has thought about this enough to stick that cast there, and that it should be correct -- that is, compare the value of n to exactly (size_t)-1.

So, even if the C standards would guarantee that if (n == -1) does the exact same thing, I'd still want to see that cast there, just to a little confirmation that the programmer knew exactly what value they're comparing to.

For the exact same reason, I tend to use if (n == (ssize_t)-1) when n is of type ssize_t, which is defined to be a signed type itself.

Originally Posted by guraknugen

Could it be that it's a typo? ”if(c != n)” seems to make more sense, doesn't it?

Abso-frigging-lutely it is a typo; good catch!

Yes, it was definitely intended to be if (c != n). The purpose of the test is, of course, to verify that nothing changed between the two mbstowcs() calls.

Now that I think about it, it would be even better to replace the +1 with +2 in that function. You see, there is the possibility some implementations return size-1 if they run out of buffer space. Using +2, i.e. one extra wide character for the buffer, would mean we'd catch that case too.

See how useful it is to ask questions? I definitely do like it, because they help me catch my own goofs, and help make the code I write even better.

Originally Posted by guraknugen

In my working example, the following is not good, is it?

In both cases you're wasting memory. (Technically, it is leaking memory, but since all memory is freed when the process quits, it's only an issue while the program runs.)

Given the discussion and notes above, here's an even better approach, using the same approach getline() uses:

Code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <errno.h>

size_t widen(wchar_t **const dataptr, size_t *const sizeptr, const char *const narrow)
{
    wchar_t *data;
    size_t   size, len, check;

    if (!dataptr || !sizeptr || !narrow) {
        errno = EINVAL;
        return (size_t)0;
    }

    if (*dataptr) {
        data = *dataptr;
        size = *sizeptr;
    } else {
        data = NULL;
        size = 0;
    }

    len = mbstowcs(NULL, narrow, 0);
    if (len == (size_t)-1) {
        errno = EILSEQ;
        return (size_t)0;
    }

    if (len + 2 > size) {
        size = (len | 127) + 129; /* Allocate at least +2; here, some extra. */
        data = realloc(data, size);
        if (!data) {
            /* Note: *dataptr still valid and exists. */
            errno = ENOMEM;
            return (size_t)0;
        }        
        *dataptr = data;
        *sizeptr = size;
    }

    check = mbstowcs(data, narrow, size);
    if (check != len) {
        errno = EBUSY; /* Something changed from under us! */
        return (size_t)0;
    }

    /* We may return 0, if narrow was empty string.
     * So, set errno in all cases. */
    errno = 0;
    return len;
}

int main(int argc, char *argv[])
{
    wchar_t *wide = NULL;
    size_t   size = 0;
    size_t   len;
    int arg;

    setlocale(LC_ALL, "");

    for (arg = 1; arg < argc; arg++) {
        len = widen(&wide, &size, argv[arg]);
        if (errno) {
            fflush(stdout);
            fprintf(stderr, "%s: %m.\n", argv[arg]);
            return EXIT_FAILURE;
        }
        printf("\"%s\" = %zu characters: L\"%ls\".\n", argv[arg], len, wide);
    }

    /* Since we are exiting, we don't need to free the wide buffer,
     * but if we wanted to, or did some other work so freeing the memory
     * for other stuff made sense, this is how to do it safely: */
    free(wide);
    wide = NULL;
    size = 0;

    return EXIT_SUCCESS;
}

On an Ubuntu installation, you can use the command

Code:

awk '(NF >= 2 && $1 !~ /#/) { print $2 }' /usr/share/i18n/SUPPORTED | sort | uniq

to list all the character sets that are supported for locales. Although normally you should just install a language pack, you can just install a specific locale for testing. To find the locale names that use an interesting character set, say GB2312, use

Code:

awk '($2=="GB2312") { print $1 }' /usr/share/i18n/SUPPORTED

Then, install one of those locales, minimally (no language packs or anything, just the locale for testing here) using

Code:

sudo sh -c 'locale-gen zh_SG ; update-locale'

and test with the above program (I'm assuming you compiled it to example) using say

Code:

LANG=zh_SG.GB2312 LC_ALL=zh_SG.GB2312 ./example "`printf '你好世界\n' | iconv -t GB2312`"

Since your terminal is in UTF-8, you'll see something like

"��" = 4 characters: L"��".

but you could amend the stanza to

Code:

LANG=zh_SG.GB2312 LC_ALL=zh_SG.GB2312 ./example "`printf '你好世界\n' | iconv -t GB2312`" | iconv -f GB2312

but then the output is the obvious

"你好世界" = 4 characters: L"你好世界"

and it's not at all clear that it used GB2312 internally; you'd have to trust the command. Dropping the latter iconv lets you verify it's not sneakily using UTF-8 behind your back, because the output makes no sense (in UTF-8, that your terminal is using).

**guraknugen** · 05-02-2015

Wow, that was really a lot of great and useful information. Thanks a lot!

I searched yesterday for information about error handling and it seems like some people don't like the use of errno.h unless for system stuff. Since you obviously use it, what are your thoughts about it?

**Nominal Animal** · 05-02-2015

Originally Posted by guraknugen

I searched yesterday for information about error handling and it seems like some people don't like the use of errno.h unless for system stuff. Since you obviously use it, what are your thoughts about it?

I use it a lot in utility functions, but only to pass error codes, and only using the known error codes. See man 3 errno.

In Linux, you could also use

Code:

__thread int myerrno = 0;

for your own code, but why bother?

I don't see any reason to avoid its use, as long as you restrict yourself to the known error codes. For example, I did use EBUSY (Device or resource busy) above to indicate an unexpected change in the string or locale, and although it doesn't match exactly, it's good enough fit for me.

**guraknugen** · 05-02-2015

Originally Posted by Nominal Animal

Code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <errno.h>

size_t widen(wchar_t **const dataptr, size_t *const sizeptr, const char *const narrow)
{
    wchar_t *data;
    size_t   size, len, check;

    if (!dataptr || !sizeptr || !narrow) {
        errno = EINVAL;
        return (size_t)0;
    }

    if (*dataptr) {
        data = *dataptr;
        size = *sizeptr;
    } else {
        data = NULL;
        size = 0;
    }

    len = mbstowcs(NULL, narrow, 0);
    if (len == (size_t)-1) {
        errno = EILSEQ;
        return (size_t)0;
    }

    if (len + 2 > size) {
        size = (len | 127) + 129; /* Allocate at least +2; here, some extra. */
        data = realloc(data, size);
        if (!data) {
            /* Note: *dataptr still valid and exists. */
            errno = ENOMEM;
            return (size_t)0;
        }        
        *dataptr = data;
        *sizeptr = size;
    }

    check = mbstowcs(data, narrow, size);
    if (check != len) {
        errno = EBUSY; /* Something changed from under us! */
        return (size_t)0;
    }

    /* We may return 0, if narrow was empty string.
     * So, set errno in all cases. */
    errno = 0;
    return len;
}

int main(int argc, char *argv[])
{
    wchar_t *wide = NULL;
    size_t   size = 0;
    size_t   len;
    int arg;

    setlocale(LC_ALL, "");

    for (arg = 1; arg < argc; arg++) {
        len = widen(&wide, &size, argv[arg]);
        if (errno) {
            fflush(stdout);
            fprintf(stderr, "%s: %m.\n", argv[arg]);
            return EXIT_FAILURE;
        }
        printf("\"%s\" = %zu characters: L\"%ls\".\n", argv[arg], len, wide);
    }

    /* Since we are exiting, we don't need to free the wide buffer,
     * but if we wanted to, or did some other work so freeing the memory
     * for other stuff made sense, this is how to do it safely: */
    free(wide);
    wide = NULL;
    size = 0;

    return EXIT_SUCCESS;
}

I did some tests with your code above, and guess what… I have a couple of (newbie) questions…

It works, but I don't really understand the pointer-to-pointer thing in the widen function. In main, wide is declared as a pointer and then &wide is sent to the function. That makes sense to me too, kind of, but it also feels confusing. I didn't even know that there is an address to the address to ”wide”…
I would appreciate some further explaining about this. Can this be done without the pointer-to-pointer arrangement?
I searched around the web, but I only found simple examples that I already understand…

Next, there is the ”size = (len | 127) + 129;” line.
Let's say that len+2=10, so len=8.
Then, size=(len | 127)+129=(8|127)+129=127+129=256. Why do we want that?

**Nominal Animal** · 05-03-2015

Originally Posted by guraknugen

I don't really understand the pointer-to-pointer thing in the widen function.

Ah, that means you haven't used getline() to read lines without line length restrictions, have you?

The idea is that the function is given pointers to both the buffer and the length dynamically allocated for the buffer, to be used for the conversion. If not long enough, the function will simply reallocate a bigger buffer.

The idea is that on the first call, you initialize them to NULL and zero. Reallocating a NULL pointer is safe, it is equivalent to a normal allocation.

On consecutive calls, the buffer is reused since it is already allocated, but it will be reallocated if more room is needed.

Remember: you are just supplying a pointer to the wide-string-pointer, and a pointer to the allocated-size (for that wide-string-pointer).

The pointed-to values do need to be initialized for the first call, but NULL and 0 are perfectly acceptable, and cause the function to do the memory allocation itself.

Originally Posted by guraknugen

Can this be done without the pointer-to-pointer arrangement?

Well, you could use a structure to hide the details.. but that'd be pretty weak C-fu, in my opinion.

Now, I do understand it looks weird and scary at first, but don't let that stop you. Kids don't let weird and scary stop them, so why would you, a big and strong programmer?

It is an extremely useful, versatile pattern you should grasp without fear.

I do also realize that many online examples that utilise this pattern are wrong.

The crasseux.com C tutorial is a good example: Don't do that, it's just confusing as heck. No need to malloc() anything, just initialize the pointer to NULL, the size to zero, and let the initial call do the initial allocation as it sees fit.

Originally Posted by guraknugen

size = (len | 127) + 129;

You could safely use just size = len + 2; there.

The pattern I am utilizing here is rounding to the next multiple of a power of two:

newvalue = ( oldvalue OR ( 2^N - 1 ) + 2^N + 1

where N is any positive integer.

It is one of the idioms that are characteristic to me. I just like it, because it compiles to extremely efficient code (you could say that the calculation is just about free), and the result is a multiple of a power of two -- here, a multiple of 12 --, and grows by at least 2^N. (Here, we just need N >= 1, since we need size to be at least len + 2.)

Using something like len + 2 or len + 100 would work just as well.

While len + 2 is sufficient, it means we only allocate the minimum amount we need. That means that for some input, say input where each successive string is one character longer, we end up reallocating every single time.

So, it is a good policy to allocate a bit more, perhaps to the next multiple of something, so that we hopefully don't need to make so many reallocations, overall.

Furthermore:

If we oversize our buffer a bit, then in practice, we could try the initial conversion opportunistically:

Code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <errno.h>
 
size_t widen(wchar_t **const dataptr, size_t *const sizeptr, const char *const narrow)
{
    wchar_t *data;
    size_t   size, len, check;
 
    if (!dataptr || !sizeptr || !narrow) {
        errno = EINVAL;
        return (size_t)0;
    }
 
    if (*dataptr) {
        data = *dataptr;
        size = *sizeptr;
    } else {
        data = NULL;
        size = 0;
    }

    if (size > 1) {
        /* Try doing the conversion in our currently allocated buffer. */
        len = mbstowcs(data, narrow, size);
        if (len == (size_t)-1) {
            errno = EILSEQ;
            return (size_t)0;
        }
        /* The buffer was long enough! */
        if (len + 1 < size)
            return len;
    }

    len = mbstowcs(NULL, narrow, 0);
    if (len == (size_t)-1) {
        errno = EILSEQ;
        return (size_t)0;
    }

In the cases where our buffer is already large enough -- and that should be most of them, if we oversize the allocation; I'd use size = (len | 1023) + 1025; or size = len + 1024; for this -- the opportunistic version will only do one mbstowcs() call, leading to much faster execution!

On the downside, if our buffer is not large enough, we end up doing three mbstowcs() calls, which is getting a bit ridiculous.

I'd bet that for a huge majority of cases, the opportunistic version is much faster (assuming oversizing at reallocation!), and those other cases are so slow (large strings) that the slowdown is not that noticeable to humans and therefore does not matter much.

Does this explain why the approach of using a pointer to the buffer pointer, and the pointer to the size allocated for the buffer, and oversizing it at reallocation, is a really good approach?

**guraknugen** · 05-03-2015

Once again, thank you so much for doing all that work!

Originally Posted by Nominal Animal

Ah, that means you haven't used getline() to read lines without line length restrictions, have you?

Got me… But now, when I looked a little at it, I realise that I probably should…

Originally Posted by Nominal Animal

Now, I do understand it looks weird and scary at first, but don't let that stop you. Kids don't let weird and scary stop them, so why would you, a big and strong programmer?

Hehehe… I'm not that big and I wouldn't call myself a programmer… I started to learn C as early as 1986, but my programming experience is rather weak. I never had a job involving any kind of programming. Now and then it happens that I write some simple code when I feel like it, but the time in between can be years sometimes and then I forgotten most of what I learned…

Originally Posted by Nominal Animal

You could safely use just size = len + 2; there.

The pattern I am utilizing here is rounding to the next multiple of a power of two:

newvalue = ( oldvalue OR ( 2^N - 1 ) + 2^N + 1

where N is any positive integer.

Ah, yes, I see what you mean.

Originally Posted by Nominal Animal

Does this explain why the approach of using a pointer to the buffer pointer, and the pointer to the size allocated for the buffer, and oversizing it at reallocation, is a really good approach?

Yes, actually I think I do, but maybe a bit overkill in my case. Interesting, nevertheless.

Thread: Segmentation fault

Thread Tools

Search Thread

Display

Similar Threads

In GDB no segmentation fault but while running segmentation fault

Help! segmentation fault

segmentation fault...please help

Segmentation fault

segmentation fault and memory fault