Reversing a string

**Dino** · 10-30-2012

Originally Posted by nonoob

No it ain't trivial. Requires C++ to start with.

Wow.

fscanf() issue...

**christop** · 10-30-2012

Taking Magi's strrev function and changing it to reverse a wide-character string:

Code:

void wcsrev(wchar_t string1[], wchar_t string2[])
{
     size_t i=0;
     size_t j = wcslen(string2) - 1;
     
     while (string2[i] != '\0')
     {
           string1[i]=string2[j];
           i++;
           j--;
     }
     string1[i]='\0';
}

I only had to change the types of string1 and string2 and call wcslen instead of strlen. I also took the liberty of changing the function's name to wcsrev and changing int to size_t. Easy.

**Click_here** · 10-30-2012

Originally Posted by nonoob

... Requires C++ to start with

... Or the standard wide char utilities

The C99 Draft (N869, 18 January, 1999)

**nonoob** · 10-31-2012

I still don't agree that it's trivial. If wchar_t is used, then say goodbye to simple indexing a char array. Unless you expect indexing (pointer dereferencing) and assignment to be magically overloaded for you. Which means you need C++.

**Click_here** · 10-31-2012

Originally Posted by nonoob

I still don't agree that it's trivial. If wchar_t is used, then say goodbye to simple indexing a char array. Unless you expect indexing (pointer dereferencing) and assignment to be magically overloaded for you. Which means you need C++.

Could you please provide an example highlighting what you mean by this?

Thanks.

**phantomotap** · 10-31-2012

Wow.

O_o

I'm sorry, but was that post supposed to prove that this was trivial in the face of UNICODE?

I only ask because the code from the relevant post does not "works to reverse a string in a UTF-8 encoded file". It only works to reverse a string with UTF8 encoding that follows a very specific set of rules. The rules a such a string must follow allows barely more than ASCII with various LATIN extensions while requiring all sequences be fully canonically composed.

A string composed of code points that need three or four bytes for a code point will not be reversed; such a string will be mangled.

A string with a decomposed character such as "(A)(`)" will not be reversed regardless of multi-byte code points; such a string will be mangled. A string that follows canonical composition with characters outset the precomposed range will also be mangled.

It is easy for you to say "multi-byte characters" and "multi-code point sequences" not allowed, but the reality is that "Résumé" can be written as "R(e)(')sum(e)(')" which invalidates the "pretty easy" implementation.

That's not even a bad notion; saying call `CraftCanonicalComposition' before calling `ReverseString' isn't an issue, but even then you'd have to deal with sequences outside of the precomposed range or note that the implementation doesn't work with such strings.

I only had to change the types of string1 and string2 and call wcslen instead of strlen.

That is all you did; that is not all you'd have to do.

Your code simply doesn't work on systems where `wchar_t' is a character set with a decomposition like UNICODE.

*shrug*

Even if this stuff was fixed, this follows a very English notion of "reverse a string".

Face it, internationalization and localization are not trivial.

Your assertions otherwise are lies, and such assertions make you look rather ignorant.

Soma

**dmh2000** · 10-31-2012

That function doesn't care about the encoding of user input.

how about UTF-8? the function would have to know the encoding

**Nominal Animal** · 10-31-2012

It is possible to duplicate an arbitrary C string in reverse order, without messing up any combining characters. You'll need a C library with C99-style wide character support, and support for the "combining" wide character type. It works fine with libc6-2.15-0ubuntu10.3 and gcc-4.6.3-1ubuntu5 on x86-64, but I'm too lazy to find out exactly how portable it is to other systems.

Here is the reverser function itself:

Code:

#include <string.h>
#include <wchar.h>
#include <wctype.h>
#include <errno.h>

int strrev(char *const target, const char *const source)
{
    const size_t  length = (source) ? strlen(source) : 0;
    size_t        offset = 0;
    size_t        have, nextlen;
    wctype_t      combining;
    mbstate_t     state;
    wchar_t       next;

    /* Invalid parameters? */
    if (!target)
        return errno = EINVAL;

    /* NULL or empty source string? */
    if (length < 1) {
        *target = '\0';
        return 0;
    }

    /* Get the combining character type. */
    combining = wctype("combining");

    /* Clear shift state. */
    memset(&state, 0, sizeof state);

    /* Length of the initial wide character. */
    nextlen = mbrtowc(&next, source, length, &state);
    if (nextlen < 1 || nextlen > length)
        return errno = EILSEQ;

    /* Character loop. */
    while (offset < length) {

        /* Take the next character. */
        have = nextlen;

        /* Scan for the next character. */
        while (offset + have < length) {

            /* Convert next character (so we can find out its class) */
            nextlen = mbrtowc(&next, source + offset + have, length - offset - have, &state);
            if (nextlen < 1 || nextlen > length - offset - have)
                return errno = EILSEQ;

            /* Not a combining character? */
            if (!iswctype((wint_t)next, combining))
                break;

            /* Combining characters are grouped with the previous character. */
            have += nextlen;
        }

        /* Copy the wide character, including any combining characters that follow it. */
        memcpy(target + length - offset - have, source + offset, have);
        offset += have;
    }

    /* Terminate target. */
    target[length] = '\0';

    return 0;
}

It is 63 lines long, and decidedly more complicated than the basic wide character string reverser shown above. The logic is quite straightforward, though: we keep the length of the next wide character (in chars) in nextlen. At the start of each outer loop iteration, we start with that character, then scan the next character, but include any combining characters to the current one, so that the complete sequence is copied as is.

The main program must set at least the LC_CTYPE locale (setlocale(LC_CTYPE, "") suffices), so that the character type classification works.

To test this, I used

Code:

#include <stdlib.h>
#include <locale.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    int arg;

    /* Use the LC_CTYPE locale as set by user running this. */
    setlocale(LC_CTYPE, "");

    /* Reverse each command-line parameter. */
    for (arg = 1; arg < argc; arg++) {
        const char *const s = argv[arg];
        const size_t      n = (s) ? strlen(s) : 0;
        char             *t;

        t = malloc(n + 1);
        if (!t) {
            fprintf(stderr, "Out of memory.\n");
            return 1;
        }

        if (strrev(t, s)) {
            fprintf(stderr, "%s: Cannot reverse string.\n", argv[arg]);
            return 1;
        }

        printf("Original: \"%s\"\n", s);
        printf("Reversed: \"%s\"\n", t);

        free(t);
    }

    return 0;
}

The string "foo x⃣ o⃝⃔ bar" reverses to "rab o⃝⃔ x⃣ oof".

Note that "x⃣" is actually U+0078 U+20e3 and that "o⃝⃔" is actually U+006f U+20dd U+20d4; i.e. these are combined characters.

I will not claim the above is trivial, but I think it is quite straightforward to do.

I know that most applications do not bother to consider combining characters. It is a pity, but understandable if you consider how much work one needs to do to do it right. It would be much, much simpler if we had a version of mbrlen() that returned the length of the glyph, i.e. would include any combining characters in the length (say, mbrglyphlen()). If there was anyone that felt this to be sufficiently important, it should not be impossible to convince the various C library developers to include that as an extension -- but it's a multi-year effort, and you need convincing, important use cases. In particular, the developers who only speak English tend to ignore these issues altogether. Out of sight, out of mind, I guess.

**phantomotap** · 10-31-2012

It is 63 lines long, and decidedly more complicated than the basic wide character string reverser shown above.

O_o

I didn't look at the code much, but the description looks accurate.

Now that you have a version that works with UNICODE, you need to also consider locales and how other languages consider "reverse this string" with respect to non-combining characters like "ch" in Slovak or other characters represented by multiple glyphs that aren't formed from combining characters.

I fear my point was lost because I focused on how the examples were incomplete. The point here wasn't that it can't be done. It can be done. It has been done. It does however require intimate knowledge of the locale as well as encoding. Any simple effort will be wrong for a given locale. I had not intended to offer a challenge. Please don't try to do this like the comment was a challenge. Seriously, we could do this forever; internationalization and localization is a huge and complicated mess. Even massive, well respected libraries like "International Components for Unicode" gets stuff wrong.

I know that most applications do not bother to consider combining characters.

I believe this is a case of "Wrong tool for the job.". It is hard to get even reasonable behavior out of the box with only what ISO C provides. If you want to invest in localization and internationalization more sophisticated libraries should be considered.

It is a pity, but understandable if you consider how much work one needs to do to do it right.

Indeed. This is one situation where "reinventing the wheel" is almost completely wasteful even if it is just for educational purposes.

[Edit]
To be clear, I usually consider "reinventing the wheel" to be a solid win when done for educational purposes.
[/Edit]

It would be much, much simpler if we had a version of mbrlen() that returned the length of the glyph, i.e. would include any combining characters in the length (say, mbrglyphlen()).

The character and code point based iterator patterns employed by a lot of libraries are significantly better for character and code point based operations like the one in question.

Code:

struct string * sSource(CreateString());
struct string * sDestination(CreateString());
LoadString(sSource, "Whatever");
struct character * sCharacter;
while(sCharacter = GetNextCharacter(sCharacter))
{
    Prepend(sDestination, sCharacter);
}
Finalize(sDestination);
Free(sCharacter);
Free(sSource);

The `GetNextCharacter', whatever it may be called, essentially performs the same operation as the loop in a context that is aware of the locale.

[Edit]
@Nominal Animal

The `setlocale(LC_CTYPE, "");' alone breaks portability. No, I am not joking. I wish I was, but I am not. Some vendors just don't implement "get the user's preferred locale" correctly and probably never will. Well, it isn't as if some systems make it easy to implement that.
[/Edit]

Soma

**Click_here** · 10-31-2012

Quickly looking at nonoob's comment that this task "requires" C++ -> Can we say that this was inaccurate?

**phantomotap** · 10-31-2012

Can we say that this was inaccurate?

Sure.

It is exactly as inaccurate as: "You'll need a C library with C99-style wide character support, and support for the "combining" wide character type.".

Clearly though, Nominal Animal isn't saying that it can't be done without C99 support.

In the same way, given the post in question and the post to which it responds, nonoob intends much the same.

Soma

**christop** · 10-31-2012

Originally Posted by phantomotap

I only ask because the code from the relevant post does not "works to reverse a string in a UTF-8 encoded file". It only works to reverse a string with UTF8 encoding that follows a very specific set of rules. The rules a such a string must follow allows barely more than ASCII with various LATIN extensions while requiring all sequences be fully canonically composed.

Originally Posted by dmh2000

how about UTF-8? the function would have to know the encoding

If the user/file input is UTF-8 or any other encoding, the caller must convert the character string into a wide-character string. The function doesn't have to deal with UTF-8 because the wide string is never going to be in UTF-8. Many systems do use the Unicode character set (in contrast to character encodings of the Unicode character set like UTF-8 or ISO-8859-1) for the wchar_t type, but regardless of the character set, wide characters are an internal representation only.

Now, the only valid complaints about my wide-string reversing function is that it does not handle combining characters and other stuff like zero-width joiner characters properly. But the complaints about UTF-8 or other character encodings are way off base. As I said, the caller must convert a string of characters from whichever encoding it's in to a string of wide characters using scanf, mbstowcs, or a number of other methods.

Originally Posted by nonoob

I still don't agree that it's trivial. If wchar_t is used, then say goodbye to simple indexing a char array. Unless you expect indexing (pointer dereferencing) and assignment to be magically overloaded for you. Which means you need C++.

What oh what did we ever do before function/method overloading? If wchar_t is used, you then have simple indexing of a wchar_t array. At least with a wchar_t array you have to deal only with combining characters and the like, but with a char array you have to deal with those issues as well as character encodings.

Lastly, while Nominal Animal's solution handles combining characters, it also performs three different tasks inside a single function: converting from a multi-byte character string to a wide character string, reverse the wide character string, and converting from a wide character string to a multi-byte character string. I'm sure someone can easily convert it to operate on an existing wide character string. It would probably be about half the length.

Originally Posted by Click_here

Quickly looking at nonoob's comment that this task "requires" C++ -> Can we say that this was inaccurate?

Yes. C++ just makes some tasks easier than C does.

**Nominal Animal** · 10-31-2012

Originally Posted by phantomotap

how other languages consider "reverse this string" with respect to non-combining characters like "ch" in Slovak or other characters represented by multiple glyphs that aren't formed from combining characters.

I agree with your point.

Actually, in some cases the multi-glyph groupings depend on the context, which cannot be derived from the locale alone. For example, consider ä, æ, and ae in European proper names. The rules on how to handle them in many cases depend on the origin country and language, not the country and language the user/application is using.

Furthermore, how do you "reverse" something like Hangul? Its composition and decomposition itself are difficult enough.

Originally Posted by phantomotap

Please don't try to do this like the comment was a challenge.

Oh no, I won't. I agree with you.

The entire concept of "reversing" a string depends on the character set, locale, and context. The best you can do in the general case is to reverse the glyphs.

The naive ASCII-based one will break multi-byte characters, like UTF-8 strings containing non-ASCII characters.

A naive multibyte-character one will break combined characters, combining to the wrong printable character. This already needs support that some vendors neglect.

The code I showed does reverse glyphs correctly, but only if the locale is set and the C library and compiler support it, including the non-standard features the function depends on. (The "combining" wide character class is something I guessed or stumbled upon by trial and error, not something I saw documented anywhere.)

Any further "reversing" requires detailed information on the language and context used. Even then, I'm not at all sure there is any way you can reverse e.g. ideograms used by some languages.

My post was more about reversing glyphs, and that we are missing some features in multibyte-character support in C to do even that correctly. A function that would provide the length in bytes for the next glyph would help a lot, but is not specified by any of the standards. Consider my code an example workaround for that.

Even then, the function should be in a dynamically loaded, well maintained library, because Unicode itself is still evolving, and the combining rules may change (there may be at some point new combining characters). (I think it should be in the C library fo it to actually make a difference in practice.)

Originally Posted by phantomotap

The character and code point based iterator patterns employed by a lot of libraries are significantly better for character and code point based operations like the one in question.

That approach does not work with Unicode combining characters. Many glyphs (like the two special ones I showed above) cannot be normalized to a single character, so that approach will not help with the more common glyph-related issues.

A typical issue is that you cut the string in mid-glyph when copying/pasting. When you cut/copy a part of a string, where the last character in a selection is part of a combined glyph, you usually lose the combining marks, and only get the base character. You see this when advancing a selection using shift and cursor keys, for example. Or copy such a string into an editable text box, move to the start/end, and see what happens when you press delete/backspace. It's pretty weird to see glyphs change instead of vanish -- unless you know what is happening.

A function that would return, reliably, the number of bytes that define the next glyph, would suit C code much better, and allow the same coding pattern you showed, but also help with the glyph issues.

For example, when making a selection in a string, you could easily always make sure you select complete glyphs and not just characters. It would be even easier if you had functions that return the beginning or end of a glyph, when given a string buffer and a char (byte) offset into it (and optionally the multibyte shift state at the beginning of the string buffer).

Originally Posted by phantomotap

The `setlocale(LC_CTYPE, "");' alone breaks portability.

Yes, I know. That was one of the reasons I didn't bother to even cursorily check the portability. Even fewer implement the non-standard "combining" wide character class specification; I suspect you'll only find it in Linux and BSD variants, and those systems that use a GNU compiler and C library.

To fix the situation, you'd need real push, a multi-year project, with someone putting pressure to proprietary vendors too. Political, not just grass-roots.

**phantomotap** · 10-31-2012

The function doesn't have to deal with UTF-8 because the wide string is never going to be in UTF-8. Many systems do use the Unicode character set (in contrast to character encodings of the Unicode character set like UTF-8 or ISO-8859-1) for the wchar_t type, but regardless of the character set, wide characters are an internal representation only.

O_o

I'm afraid you've misread what the standard says about multi-byte characters, `wchat_t', and wide characters.

The standard leaves a lot to be desired by placing much into the "implementation defined" category.

A compiler is free to use UTF8 for wide strings if the target environments native character set and locale can be so represented. Several real world compilers do use UTF16 for wide strings. (The problems referenced for UTF8 applies to UTF16.)

As far as "never going to be in UTF8", the input encoding doesn't really matter if the vendor has supplied no transformation from the relevant encoding to the compiler's wide character set. The crucial bits of the standard only requires such transformation for the largest native supported extended character set.

In other words, the sequence L"Hello, World!" when converted by a confirming compiler supporting the "enUS" locale to a wide character sequence should never be a problem. However, if a compiler uses a 16 bit `wchar_t' there is simply no way to map the entire UNICODE character set to a single wide character representation so a multi-byte sequence must remain a multi-byte sequence.

We are ultimately back where we started because, as I've said, it just isn't that trivial.

Actually, in some cases the multi-glyph groupings depend on the context, which cannot be derived from the locale alone.

Truth. Or even when context plays no role, dialect can place further burdens on software beyond what locales already manage.

Furthermore, how do you "reverse" something like Hangul?

I didn't have anything specific in mind, but that sort of complexity is exactly what I was referring to when I said "a very English notion of "reverse a string"".

Even then, I'm not at all sure there is any way you can reverse e.g. ideograms used by some languages.

To think, some languages would wind up being the same if a simple notion of "reverse" was applied.

That approach does not work with Unicode combining characters.

[Edit]
I realize that we are passing each other by because we are using two words each to mean two different things.

*shrug*

Things get muddy.

I've tried to use "code point" only to reference a single "character" in the relevant character set such as 'A' being ASCII 41.

I've tried to use "glyph" only to reference a single drawn character as rendered completely in the given locale such as Ä being "A-Umlaut".

I've tried to use "character" only to reference a single character entirely within the given language such as the Slovak "ch" being a single character.

I've tried to use "combining character" exactly as UNICODE defines.

With those definitions in mind consider again the code or explanation.
[/Edit]

Actually, it does.

I've not had any specific library in mind, but as you can see from the code I've referenced an unspecified structure.

Conceptually, we can assume that `GetNextCharacter' fills a structure with information such as the span (in code points perhaps) of the next code point and when encountering a combining character continues to fill the structure with more code points treating the thing entirely as a single unit.

It doesn't help with the dialect or contextual stuff of course.

A function that would return, reliably, the number of bytes that define the next glyph, would suit C code much better, and allow the same coding pattern you showed, but also help with the glyph issues.

This operation would necessarily be a measure of the functionality upon which `GetNextCharacter' would depend.

[Edit]
I moved this bit about for clarity.
[/Edit]

It would be even easier if you had functions that return the beginning or end of a glyph, when given a string buffer and a char (byte) offset into it (and optionally the multibyte shift state at the beginning of the string buffer).

[Edit]
Removed.

Now using your definitions as I've interpreted them: this is exactly the mechanism `GetNextCharacter' provides by placing such information into a structure.
[/Edit]

To fix the situation, you'd need real push, a multi-year project, with someone putting pressure to proprietary vendors too.

This is why I advocate just moving to a decent library like "ICU".

We, as in several tens of thousands of developers, can't get Microsoft to properly implement C99.

I see proper localization and internationalization support as a complete impossibility.

Soma

**Nominal Animal** · 11-01-2012

Originally Posted by phantomotap

I realize that we are passing each other by because we are using two words each to mean two different things. [code point], [glyph], [character]

Yes, sorry. I think I'm in complete agreement with you, although I did certainly forget to differentiate between characters and code points. Technically:

"Code point" is the correct term for each individual Unicode atom.

"Grapheme cluster" is probably the best term to describe an unit made up of a base character and all combining characters applied to it. (Glyph can mean the same thing; however, all the parts are glyphs too, so there is a risk of confusion.)

"Ligature" is the probably the best term to describe language-dependent units like Slovak "ch" that use more than one "grapheme cluster" to describe a single unit. Thus, to describe a language-dependent letter/character/phoneme/graph, we could say "grapheme cluster or ligature". (If we limit to just Latin alphabets, then we could use "phoneme" without much confusion; it would be terrible with non-Latin alphabets, though.)

In other words, we don't even have an unambiguous set of terms to describe these things. No wonder the situation is so messy!

Originally Posted by phantomotap

Conceptually, we can assume that `GetNextCharacter' fills a structure with information such as the span (in code points perhaps) of the next code point and when encountering a combining character continues to fill the structure with more code points treating the thing entirely as a single unit.

That would be excellent, but I don't know of any library that actually does that. Unless I am mistaken, even the CharacterIterator in ICU does not do that.

However, ICU does have the character-boundary iterator, which does do exactly what I was referring to.

Even ICU does not have facilities for language-specific ligatures like fl that use multiple grapheme clusters; it only recognizes ligatures like ﬂ that are single grapheme clusters. Fortunately, you (the application programmer) rarely needs that, unless doing something silly like trying to "reverse" a string.

Originally Posted by phantomotap

This is why I advocate just moving to a decent library like "ICU".

Agreed and seconded. I would prefer getting this basic functionality into a standard library, but that's not going to happen. Some proprietary vendors just do not care.

In practical applications, it is important that the library is dynamic and regularly maintained, because the rules evolve over time. Hopefully, major versions of the library will be updated when the rules change (rather than providing just an update to the latest version of the library), so already existing applications will not need to be recompiled to update the rules. Updating just the library should be sufficient. In particular, International Components for Unicode seems to have a stable community behind it, so I'd personally trust it to provide the updates in the future too. I would only consider writing my own when working in an embedded environment where using such a library would be difficult or impossible.

Thread: Reversing a string

Thread Tools

Search Thread

Display

Similar Threads

reversing a string, again..

reversing a string

Reversing a string

Reversing a string

Reversing a String

Tags for this Thread