Thread: ASCII and EBCDIC

  1. #1
    Registered User
    Join Date
    Feb 2022
    Location
    Canada, PEI
    Posts
    103

    ASCII and EBCDIC

    I'd love to make a program which takes certain characters(a-z,A-Z,0-9) and map them to int values but I'm unsure of how to make that portable and correct.

    The problem comes from investigating the ASCII characters vs the EBCDIC characters. The ASCII characters have the characters(a-z,A-Z,0-9) in the first 128 positions and EBCDIC the characters occupying the positions 128-255.

    Now here's my problem.. What type should I use(for my character) to make the program portable? Will the char type do?

    What is char , signed char , unsigned char , and character literals in C? | by mohamad wael | Analytics Vidhya | Medium
    The char type can be signed or it can be unsigned , this is implementation defined . The C standard defines , the minimum range that the char type can have , an implementation can define large ranges .
    If the char type is unsigned , then it can only contain non negative values , and its minimum range as defined by the C standard is between 0 , and 127 . If the char type is signed , then it can contain 0 , negative , and positive values , and its minimum range as defined by the C standard , is between -127 , and 127 .
    Beside the char type in C , there is also the unsigned char , and the signed char types . All three types are different , but they have the same size of 1 byte . The unsigned char type can only store nonnegative integer values , it has a minimum range between 0 and 127 , as defined by the C standard. The signed char type can store , negative , zero , and positive integer values . It has a minimum range between -127 and 127 , as defined by the C standard .
    Reading the above link, I would assume the implementation would take care of the size details and I could use the code below and it would work(well would work if I compiled it for the current implementation):
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    int get_index(char c) {
      switch (c) {
      case 'a':
      case 'A':
        return 0;
      case 'b':
      case 'B':
        return 1;
        /*and continues for the remaining characters*/
      default:
        return -1;
      }
    }
    
    int main(int argc, char ** argv) {
      fprintf(stdout, "%d\n", get_index('A'));
      fprintf(stdout, "%d\n", get_index('b'));
      return EXIT_SUCCESS;
    }

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Your strategy of using char and character constants in a switch should be absolutely portable for any standard conforming C implementation since the C standard explicitly mandates that those characters that you're interested in are in the basic source and basic execution character sets.

    That said, is supporting EBCDIC just an exercise, or is it of some practical concern? There may be a few legacy systems that use it today, but writing new software that explicitly has it in mind seems a little strange in 2022.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Dec 2017
    Posts
    1,628
    EBCDIC is apparently still used in IBM mainframes, although ASCII is "tolerated" in that you can edit ascii files and the cpu has instructions for conversion. EBCDIC - Wikipedia

    I would use int since character literals are ints in C anyway, and functions like getchar return an int.
    Code:
    #include <stdio.h>
    #include <ctype.h>
     
    int get_index(int ch)
    {
        ch = toupper(ch);
        if (ch >= 'A' && ch <= 'I') return ch - 'A';
        if (ch >= 'J' && ch <= 'R') return ch - 'J' + 9;
        if (ch >= 'S' && ch <= 'Z') return ch - 'S' + 18;
        return -1;
    }
     
    int main()
    {
        int ch;
        while ((ch = getchar()) != EOF)
            printf("%d\n", get_index(ch));
        return 0;
    }
    A switch might actually be faster.
    Code:
    int get_index(int ch)
    {
        switch (toupper(ch))
        {
        case 'A': return 0;
        case 'B': return 1;
        }
        return -1;
    }
    Last edited by john.c; 03-24-2022 at 09:59 PM.
    A little inaccuracy saves tons of explanation. - H.H. Munro

  4. #4
    Registered User
    Join Date
    Feb 2022
    Location
    Canada, PEI
    Posts
    103
    I just used EBCDIC as an example of characters which differed from ASCII to help with my example. I'm trying to get into a C frame of mind when I sit down and code, basically I'm trying to understand the problem in a scope larger than the computer I'm sitting in front of.

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Code:
    int get_index(char c) {
        const char *table = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
        char *p = strchr(table,c);
        if ( p ) {
            return p - table;
        } else {
            return -1;
        }
    }
    Or if you're doing this a lot, perhaps a lookup table.
    Code:
    int get_index(char c) {
        static int init = 0;
        static lut[256];
        if ( !init ) {
            const char *table = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
            for(int i = 0 ; i < 256; i++ ) lut[i] = -1;
            for(int i = 0 ; table[i] ; i++ ) {
                lut[table[i]] = i;
            }
            init = 1;
        }
        return table[c];
    }
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User
    Join Date
    Feb 2019
    Posts
    1,078
    Well... working with single byte charsets isn't a real problem with portability... nowadays we have a difficult one: In Windows you can work with WINDOWS-1252 charset (a modified ISO-88591-1 charset) and in Linux and other UNIXes is common to use UTF-8 (a multi byte charset).

    Take the character † as an example... In WINDOWS-1252 is 0x86, doesn't exist in ISO-88591-1, but in UTF-8 is U+2020 or 0xe2 0x80 0xa0 (3 bytes long).

    One solution is to convert charsets using some library like libiconv and ask the library to ignore or transliterate characters from one charset that don't exist to another. Another solution is to use wchat_t, instead of char, as unit for characters, and the functions declared at wchar.h. Typically wchat_t is 32 bits long and supports any UNICODE codepoints.

    Notice that C is a multi byte charset capable language (described in ISO-9899), you can, with no problems, declare something like this:

    Code:
    char s[] = "مرحبا بالعالم"; // "Hello, world" in Arabic
    But this is UTF-8... You can do:
    Code:
    wchar_t s[] = L"مرحبا بالعالم"; // "Hello, world", using wchar_t
    First case (s with char):
    Code:
    $ objdump -s test.o
    test.o:     file format elf64-x86-64
    
    Contents of section .data:
     0000 d985d8b1 d8add8a8 d8a720d8 a8d8a7d9  .......... .....
     0010 84d8b9d8 a7d984d9 8500               ..........
    Second, with wchar_t:
    Code:
    $ objdump -s test.o
    test.o:     file format elf64-x86-64
    
    Contents of section .data:
     0000 45060000 31060000 2d060000 28060000  E...1...-...(...
     0010 27060000 20000000 28060000 27060000  '... ...(...'...
     0020 44060000 39060000 27060000 44060000  D...9...'...D...
     0030 45060000 00000000

  7. #7
    Registered User
    Join Date
    Feb 2019
    Posts
    1,078
    There are routines in wchar.h to convert to/from wchar_t to/from local charset, but you have to set locale properly, typically:
    Code:
      setlocale( LC_ALL, "" );

  8. #8
    Registered User
    Join Date
    Feb 2022
    Location
    Canada, PEI
    Posts
    103
    A little off topic.. I remember back in the days when you had to have a ASCII text editor to write code in C. I guess with the proliferation of multi-byte characters and unicode character sets that has been upped to a unicode editor.

  9. #9
    Registered User
    Join Date
    Feb 2019
    Posts
    1,078
    Quote Originally Posted by G4143 View Post
    A little off topic.. I remember back in the days when you had to have a ASCII text editor to write code in C. I guess with the proliferation of multi-byte characters and unicode character sets that has been upped to a unicode editor.
    Yep, but it is nice to know that multi byte charsets can be traced back to Dennis Ritchie and Ken Thompson in the 70's (UNICODE was designed based on experiences made by XEROX in the 80's -- as standard it arrived at the 90's). And since C99 UNICODE (and multi byte charsets) are supported by ISO 9899 standard (see 5.2.1.2 and the normative Annex D).
    Last edited by flp1969; 03-25-2022 at 10:46 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Trying to understand format of old FTP connection using EBCDIC
    By technokid in forum Networking/Device Communication
    Replies: 0
    Last Post: 10-14-2011, 09:48 PM
  2. write a variable block EBCDIC file
    By htmanbloodmoney in forum C++ Programming
    Replies: 0
    Last Post: 05-26-2010, 02:18 PM
  3. EBCDIC to ASCII
    By C of Green in forum C# Programming
    Replies: 5
    Last Post: 05-21-2009, 07:07 AM
  4. ASCII character with ASCII value 0 and 32
    By hitesh_best in forum C Programming
    Replies: 4
    Last Post: 07-24-2007, 09:45 AM
  5. EBCDIC files
    By Unregistered in forum C Programming
    Replies: 1
    Last Post: 03-18-2002, 03:37 PM

Tags for this Thread