Thread: Segmentation fault

  1. #16
    Registered User
    Join Date
    Feb 2013
    Location
    Sweden
    Posts
    89
    Oh, another one. The questions never end, it seems…

    In my working example, the following is not good, is it?
    Code:
        for(int i=1; i<=2; i++)
            printf("%ls\n", ToWide(argv[i]));
        return EXIT_SUCCESS;
    My thought (why this is a bad idea) is that I should free the memory before quitting, but now there's no pointer to it…
    I guess this is better:
    Code:
        wchar_t *Wide;
        for(int i=1; i<=2; i++) {
            Wide=ToWide(argv[i]));
            printf("%ls\n", Wide);
        }
        free(Wide);
        return EXIT_SUCCESS;
    Or am I completely dizzy now?
    Last edited by guraknugen; 05-01-2015 at 02:11 PM.

  2. #17
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by guraknugen View Post
    The program counts characters and looks for characters in strings.
    In your case, I think I would use fwide(stdin,1) if you read from standard input (and the same for any files you might read), just because it then avoids the extra mbstowcs() call. If you count the characters in command-line parameters or environment variable values, use mbstowcs().

    Counting the number of characters using wide characters is much easier, and to me, makes sense here. The standard output and standard error I'd keep in narrow mode.

    Quote Originally Posted by guraknugen View Post
    Code:
    wchar_t *wide_string(const char *const s)
    What's the difference between the two ”const”?
    You need to read the type qualifiers from right to left, separated by asterisks (which you read as "a pointer to").

    Summary:
    • const char *s
      s is a pointer to const char.
      You can change where s points to, but you cannot change the characters stored there.
      For example, s++ is allowed, but s[2] = '\0' is not allowed.
    • char *const s
      s is a constant, a pointer to char.
      You cannot change where s points to, but you can change the characters stored there.
      For example, s++ is not allowed, but s[2] = '\0' is allowed.
    • const char *const s
      s is a constant, a pointer to const char.
      You cannot change where s points to, and you cannot change the characters stored there.


    I use it principally to convey programmer intent. When you learn to read the qualifiers from left to right, the intent for each variable becomes immediately clear.
    In some cases it also helps compilers produce better code, but nowadays compilers are darn smart about detecting whether stuff is changed or not.


    Quote Originally Posted by guraknugen View Post
    Code:
    if (n==(size_t)-1)
    I've been thinking about this one a little. Seems like it's checking if n==-1 after converting -1 to the size_t type, is that right? And is it really necessary to do that conversion? Shouldn't the compiler do that automatically?
    Yes, to me it is, yes.

    You see, size_t is an unsigned integer type, and it is documented to return exactly (size_t)-1 in case of an invalid input sequence.

    I'm not absolutely certain if the integer type promotion rules work out to exactly the above in all cases, as the size of the size_t type varies compared to the size of the integer constant -1. But, I don't even care enough to find out,

    You see, when I see that if (n == (size_t)-1) bit, and I remember or check that n is of type size_t, the check tells me the programmer has thought about this enough to stick that cast there, and that it should be correct -- that is, compare the value of n to exactly (size_t)-1.

    So, even if the C standards would guarantee that if (n == -1) does the exact same thing, I'd still want to see that cast there, just to a little confirmation that the programmer knew exactly what value they're comparing to.

    For the exact same reason, I tend to use if (n == (ssize_t)-1) when n is of type ssize_t, which is defined to be a signed type itself.

    Quote Originally Posted by guraknugen View Post
    Could it be that it's a typo? ”if(c != n)” seems to make more sense, doesn't it?
    Abso-frigging-lutely it is a typo; good catch!

    Yes, it was definitely intended to be if (c != n). The purpose of the test is, of course, to verify that nothing changed between the two mbstowcs() calls.

    Now that I think about it, it would be even better to replace the +1 with +2 in that function. You see, there is the possibility some implementations return size-1 if they run out of buffer space. Using +2, i.e. one extra wide character for the buffer, would mean we'd catch that case too.

    See how useful it is to ask questions? I definitely do like it, because they help me catch my own goofs, and help make the code I write even better.

    Quote Originally Posted by guraknugen View Post
    In my working example, the following is not good, is it?
    In both cases you're wasting memory. (Technically, it is leaking memory, but since all memory is freed when the process quits, it's only an issue while the program runs.)

    Given the discussion and notes above, here's an even better approach, using the same approach getline() uses:
    Code:
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    #include <errno.h>
    
    size_t widen(wchar_t **const dataptr, size_t *const sizeptr, const char *const narrow)
    {
        wchar_t *data;
        size_t   size, len, check;
    
        if (!dataptr || !sizeptr || !narrow) {
            errno = EINVAL;
            return (size_t)0;
        }
    
        if (*dataptr) {
            data = *dataptr;
            size = *sizeptr;
        } else {
            data = NULL;
            size = 0;
        }
    
        len = mbstowcs(NULL, narrow, 0);
        if (len == (size_t)-1) {
            errno = EILSEQ;
            return (size_t)0;
        }
    
        if (len + 2 > size) {
            size = (len | 127) + 129; /* Allocate at least +2; here, some extra. */
            data = realloc(data, size);
            if (!data) {
                /* Note: *dataptr still valid and exists. */
                errno = ENOMEM;
                return (size_t)0;
            }        
            *dataptr = data;
            *sizeptr = size;
        }
    
        check = mbstowcs(data, narrow, size);
        if (check != len) {
            errno = EBUSY; /* Something changed from under us! */
            return (size_t)0;
        }
    
        /* We may return 0, if narrow was empty string.
         * So, set errno in all cases. */
        errno = 0;
        return len;
    }
    
    int main(int argc, char *argv[])
    {
        wchar_t *wide = NULL;
        size_t   size = 0;
        size_t   len;
        int arg;
    
        setlocale(LC_ALL, "");
    
        for (arg = 1; arg < argc; arg++) {
            len = widen(&wide, &size, argv[arg]);
            if (errno) {
                fflush(stdout);
                fprintf(stderr, "%s: %m.\n", argv[arg]);
                return EXIT_FAILURE;
            }
            printf("\"%s\" = %zu characters: L\"%ls\".\n", argv[arg], len, wide);
        }
    
        /* Since we are exiting, we don't need to free the wide buffer,
         * but if we wanted to, or did some other work so freeing the memory
         * for other stuff made sense, this is how to do it safely: */
        free(wide);
        wide = NULL;
        size = 0;
    
        return EXIT_SUCCESS;
    }
    On an Ubuntu installation, you can use the command
    Code:
    awk '(NF >= 2 && $1 !~ /#/) { print $2 }' /usr/share/i18n/SUPPORTED | sort | uniq
    to list all the character sets that are supported for locales. Although normally you should just install a language pack, you can just install a specific locale for testing. To find the locale names that use an interesting character set, say GB2312, use
    Code:
    awk '($2=="GB2312") { print $1 }' /usr/share/i18n/SUPPORTED
    Then, install one of those locales, minimally (no language packs or anything, just the locale for testing here) using
    Code:
    sudo sh -c 'locale-gen zh_SG ; update-locale'
    and test with the above program (I'm assuming you compiled it to example) using say
    Code:
    LANG=zh_SG.GB2312 LC_ALL=zh_SG.GB2312 ./example "`printf '你好世界\n' | iconv -t GB2312`"
    Since your terminal is in UTF-8, you'll see something like
    "��������" = 4 characters: L"��������".
    but you could amend the stanza to
    Code:
    LANG=zh_SG.GB2312 LC_ALL=zh_SG.GB2312 ./example "`printf '你好世界\n' | iconv -t GB2312`" | iconv -f GB2312
    but then the output is the obvious
    "你好世界" = 4 characters: L"你好世界"
    and it's not at all clear that it used GB2312 internally; you'd have to trust the command. Dropping the latter iconv lets you verify it's not sneakily using UTF-8 behind your back, because the output makes no sense (in UTF-8, that your terminal is using).

  3. #18
    Registered User
    Join Date
    Feb 2013
    Location
    Sweden
    Posts
    89
    Wow, that was really a lot of great and useful information. Thanks a lot!

    I searched yesterday for information about error handling and it seems like some people don't like the use of errno.h unless for system stuff. Since you obviously use it, what are your thoughts about it?

  4. #19
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by guraknugen View Post
    I searched yesterday for information about error handling and it seems like some people don't like the use of errno.h unless for system stuff. Since you obviously use it, what are your thoughts about it?
    I use it a lot in utility functions, but only to pass error codes, and only using the known error codes. See man 3 errno.

    In Linux, you could also use
    Code:
    __thread int myerrno = 0;
    for your own code, but why bother?

    I don't see any reason to avoid its use, as long as you restrict yourself to the known error codes. For example, I did use EBUSY (Device or resource busy) above to indicate an unexpected change in the string or locale, and although it doesn't match exactly, it's good enough fit for me.

  5. #20
    Registered User
    Join Date
    Feb 2013
    Location
    Sweden
    Posts
    89
    Quote Originally Posted by Nominal Animal View Post
    Code:
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    #include <errno.h>
    
    size_t widen(wchar_t **const dataptr, size_t *const sizeptr, const char *const narrow)
    {
        wchar_t *data;
        size_t   size, len, check;
    
        if (!dataptr || !sizeptr || !narrow) {
            errno = EINVAL;
            return (size_t)0;
        }
    
        if (*dataptr) {
            data = *dataptr;
            size = *sizeptr;
        } else {
            data = NULL;
            size = 0;
        }
    
        len = mbstowcs(NULL, narrow, 0);
        if (len == (size_t)-1) {
            errno = EILSEQ;
            return (size_t)0;
        }
    
        if (len + 2 > size) {
            size = (len | 127) + 129; /* Allocate at least +2; here, some extra. */
            data = realloc(data, size);
            if (!data) {
                /* Note: *dataptr still valid and exists. */
                errno = ENOMEM;
                return (size_t)0;
            }        
            *dataptr = data;
            *sizeptr = size;
        }
    
        check = mbstowcs(data, narrow, size);
        if (check != len) {
            errno = EBUSY; /* Something changed from under us! */
            return (size_t)0;
        }
    
        /* We may return 0, if narrow was empty string.
         * So, set errno in all cases. */
        errno = 0;
        return len;
    }
    
    int main(int argc, char *argv[])
    {
        wchar_t *wide = NULL;
        size_t   size = 0;
        size_t   len;
        int arg;
    
        setlocale(LC_ALL, "");
    
        for (arg = 1; arg < argc; arg++) {
            len = widen(&wide, &size, argv[arg]);
            if (errno) {
                fflush(stdout);
                fprintf(stderr, "%s: %m.\n", argv[arg]);
                return EXIT_FAILURE;
            }
            printf("\"%s\" = %zu characters: L\"%ls\".\n", argv[arg], len, wide);
        }
    
        /* Since we are exiting, we don't need to free the wide buffer,
         * but if we wanted to, or did some other work so freeing the memory
         * for other stuff made sense, this is how to do it safely: */
        free(wide);
        wide = NULL;
        size = 0;
    
        return EXIT_SUCCESS;
    }
    I did some tests with your code above, and guess what… I have a couple of (newbie) questions…

    It works, but I don't really understand the pointer-to-pointer thing in the widen function. In main, wide is declared as a pointer and then &wide is sent to the function. That makes sense to me too, kind of, but it also feels confusing. I didn't even know that there is an address to the address to ”wide”…
    I would appreciate some further explaining about this. Can this be done without the pointer-to-pointer arrangement?
    I searched around the web, but I only found simple examples that I already understand…

    Next, there is the ”size = (len | 127) + 129;” line.
    Let's say that len+2=10, so len=8.
    Then, size=(len | 127)+129=(8|127)+129=127+129=256. Why do we want that?
    Last edited by guraknugen; 05-02-2015 at 11:14 AM. Reason: Failed to do it right in the first place…

  6. #21
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by guraknugen View Post
    I don't really understand the pointer-to-pointer thing in the widen function.
    Ah, that means you haven't used getline() to read lines without line length restrictions, have you?

    The idea is that the function is given pointers to both the buffer and the length dynamically allocated for the buffer, to be used for the conversion. If not long enough, the function will simply reallocate a bigger buffer.

    The idea is that on the first call, you initialize them to NULL and zero. Reallocating a NULL pointer is safe, it is equivalent to a normal allocation.

    On consecutive calls, the buffer is reused since it is already allocated, but it will be reallocated if more room is needed.

    Remember: you are just supplying a pointer to the wide-string-pointer, and a pointer to the allocated-size (for that wide-string-pointer).

    The pointed-to values do need to be initialized for the first call, but NULL and 0 are perfectly acceptable, and cause the function to do the memory allocation itself.

    Quote Originally Posted by guraknugen View Post
    Can this be done without the pointer-to-pointer arrangement?
    Well, you could use a structure to hide the details.. but that'd be pretty weak C-fu, in my opinion.

    Now, I do understand it looks weird and scary at first, but don't let that stop you. Kids don't let weird and scary stop them, so why would you, a big and strong programmer?

    It is an extremely useful, versatile pattern you should grasp without fear.

    I do also realize that many online examples that utilise this pattern are wrong.

    The crasseux.com C tutorial is a good example: Don't do that, it's just confusing as heck. No need to malloc() anything, just initialize the pointer to NULL, the size to zero, and let the initial call do the initial allocation as it sees fit.

    Quote Originally Posted by guraknugen View Post
    size = (len | 127) + 129;
    You could safely use just size = len + 2; there.

    The pattern I am utilizing here is rounding to the next multiple of a power of two:
    newvalue = ( oldvalue OR ( 2N - 1 ) + 2N + 1
    where N is any positive integer.

    It is one of the idioms that are characteristic to me. I just like it, because it compiles to extremely efficient code (you could say that the calculation is just about free), and the result is a multiple of a power of two -- here, a multiple of 12 --, and grows by at least 2N. (Here, we just need N >= 1, since we need size to be at least len + 2.)

    Using something like len + 2 or len + 100 would work just as well.

    While len + 2 is sufficient, it means we only allocate the minimum amount we need. That means that for some input, say input where each successive string is one character longer, we end up reallocating every single time.

    So, it is a good policy to allocate a bit more, perhaps to the next multiple of something, so that we hopefully don't need to make so many reallocations, overall.

    Furthermore:

    If we oversize our buffer a bit, then in practice, we could try the initial conversion opportunistically:

    Code:
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    #include <errno.h>
     
    size_t widen(wchar_t **const dataptr, size_t *const sizeptr, const char *const narrow)
    {
        wchar_t *data;
        size_t   size, len, check;
     
        if (!dataptr || !sizeptr || !narrow) {
            errno = EINVAL;
            return (size_t)0;
        }
     
        if (*dataptr) {
            data = *dataptr;
            size = *sizeptr;
        } else {
            data = NULL;
            size = 0;
        }
    
        if (size > 1) {
            /* Try doing the conversion in our currently allocated buffer. */
            len = mbstowcs(data, narrow, size);
            if (len == (size_t)-1) {
                errno = EILSEQ;
                return (size_t)0;
            }
            /* The buffer was long enough! */
            if (len + 1 < size)
                return len;
        }
    
        len = mbstowcs(NULL, narrow, 0);
        if (len == (size_t)-1) {
            errno = EILSEQ;
            return (size_t)0;
        }
    In the cases where our buffer is already large enough -- and that should be most of them, if we oversize the allocation; I'd use size = (len | 1023) + 1025; or size = len + 1024; for this -- the opportunistic version will only do one mbstowcs() call, leading to much faster execution!

    On the downside, if our buffer is not large enough, we end up doing three mbstowcs() calls, which is getting a bit ridiculous.

    I'd bet that for a huge majority of cases, the opportunistic version is much faster (assuming oversizing at reallocation!), and those other cases are so slow (large strings) that the slowdown is not that noticeable to humans and therefore does not matter much.

    Does this explain why the approach of using a pointer to the buffer pointer, and the pointer to the size allocated for the buffer, and oversizing it at reallocation, is a really good approach?

  7. #22
    Registered User
    Join Date
    Feb 2013
    Location
    Sweden
    Posts
    89
    Once again, thank you so much for doing all that work!

    Quote Originally Posted by Nominal Animal View Post
    Ah, that means you haven't used getline() to read lines without line length restrictions, have you?
    Got me… But now, when I looked a little at it, I realise that I probably should…
    Quote Originally Posted by Nominal Animal View Post
    Now, I do understand it looks weird and scary at first, but don't let that stop you. Kids don't let weird and scary stop them, so why would you, a big and strong programmer?
    Hehehe… I'm not that big and I wouldn't call myself a programmer… I started to learn C as early as 1986, but my programming experience is rather weak. I never had a job involving any kind of programming. Now and then it happens that I write some simple code when I feel like it, but the time in between can be years sometimes and then I forgotten most of what I learned…
    Quote Originally Posted by Nominal Animal View Post
    You could safely use just size = len + 2; there.

    The pattern I am utilizing here is rounding to the next multiple of a power of two:
    newvalue = ( oldvalue OR ( 2N - 1 ) + 2N + 1
    where N is any positive integer.
    Ah, yes, I see what you mean.

    Quote Originally Posted by Nominal Animal View Post
    Does this explain why the approach of using a pointer to the buffer pointer, and the pointer to the size allocated for the buffer, and oversizing it at reallocation, is a really good approach?
    Yes, actually I think I do, but maybe a bit overkill in my case. Interesting, nevertheless.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. In GDB no segmentation fault but while running segmentation fault
    By Tamim Ad Dari in forum C++ Programming
    Replies: 2
    Last Post: 12-10-2013, 11:16 AM
  2. Help! segmentation fault
    By doubty in forum C Programming
    Replies: 15
    Last Post: 06-24-2009, 06:35 AM
  3. segmentation fault...please help
    By liaa in forum C Programming
    Replies: 6
    Last Post: 03-21-2009, 09:45 AM
  4. Segmentation fault
    By bennyandthejets in forum C++ Programming
    Replies: 7
    Last Post: 09-07-2005, 05:04 PM
  5. segmentation fault and memory fault
    By Unregistered in forum C Programming
    Replies: 12
    Last Post: 04-02-2002, 11:09 PM