ASCII and EBCDIC

**G4143** · 03-24-2022

I'd love to make a program which takes certain characters(a-z,A-Z,0-9) and map them to int values but I'm unsure of how to make that portable and correct.

The problem comes from investigating the ASCII characters vs the EBCDIC characters. The ASCII characters have the characters(a-z,A-Z,0-9) in the first 128 positions and EBCDIC the characters occupying the positions 128-255.

Now here's my problem.. What type should I use(for my character) to make the program portable? Will the char type do?

What is char , signed char , unsigned char , and character literals in C? | by mohamad wael | Analytics Vidhya | Medium

The char type can be signed or it can be unsigned , this is implementation defined . The C standard defines , the minimum range that the char type can have , an implementation can define large ranges .
If the char type is unsigned , then it can only contain non negative values , and its minimum range as defined by the C standard is between 0 , and 127 . If the char type is signed , then it can contain 0 , negative , and positive values , and its minimum range as defined by the C standard , is between -127 , and 127 .
Beside the char type in C , there is also the unsigned char , and the signed char types . All three types are different , but they have the same size of 1 byte . The unsigned char type can only store nonnegative integer values , it has a minimum range between 0 and 127 , as defined by the C standard. The signed char type can store , negative , zero , and positive integer values . It has a minimum range between -127 and 127 , as defined by the C standard .

Reading the above link, I would assume the implementation would take care of the size details and I could use the code below and it would work(well would work if I compiled it for the current implementation):

Code:

#include <stdio.h>
#include <stdlib.h>

int get_index(char c) {
  switch (c) {
  case 'a':
  case 'A':
    return 0;
  case 'b':
  case 'B':
    return 1;
    /*and continues for the remaining characters*/
  default:
    return -1;
  }
}

int main(int argc, char ** argv) {
  fprintf(stdout, "%d\n", get_index('A'));
  fprintf(stdout, "%d\n", get_index('b'));
  return EXIT_SUCCESS;
}

**laserlight** · 03-24-2022

Your strategy of using char and character constants in a switch should be absolutely portable for any standard conforming C implementation since the C standard explicitly mandates that those characters that you're interested in are in the basic source and basic execution character sets.

That said, is supporting EBCDIC just an exercise, or is it of some practical concern? There may be a few legacy systems that use it today, but writing new software that explicitly has it in mind seems a little strange in 2022.

**john.c** · 03-24-2022

EBCDIC is apparently still used in IBM mainframes, although ASCII is "tolerated" in that you can edit ascii files and the cpu has instructions for conversion. EBCDIC - Wikipedia

I would use int since character literals are ints in C anyway, and functions like getchar return an int.

Code:

#include <stdio.h>
#include <ctype.h>
 
int get_index(int ch)
{
    ch = toupper(ch);
    if (ch >= 'A' && ch <= 'I') return ch - 'A';
    if (ch >= 'J' && ch <= 'R') return ch - 'J' + 9;
    if (ch >= 'S' && ch <= 'Z') return ch - 'S' + 18;
    return -1;
}
 
int main()
{
    int ch;
    while ((ch = getchar()) != EOF)
        printf("%d\n", get_index(ch));
    return 0;
}

A switch might actually be faster.

Code:

int get_index(int ch)
{
    switch (toupper(ch))
    {
    case 'A': return 0;
    case 'B': return 1;
    }
    return -1;
}

**G4143** · 03-24-2022

I just used EBCDIC as an example of characters which differed from ASCII to help with my example. I'm trying to get into a C frame of mind when I sit down and code, basically I'm trying to understand the problem in a scope larger than the computer I'm sitting in front of.

**Salem** · 03-25-2022

Code:

int get_index(char c) {
    const char *table = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    char *p = strchr(table,c);
    if ( p ) {
        return p - table;
    } else {
        return -1;
    }
}

Or if you're doing this a lot, perhaps a lookup table.

Code:

int get_index(char c) {
    static int init = 0;
    static lut[256];
    if ( !init ) {
        const char *table = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
        for(int i = 0 ; i < 256; i++ ) lut[i] = -1;
        for(int i = 0 ; table[i] ; i++ ) {
            lut[table[i]] = i;
        }
        init = 1;
    }
    return table[c];
}

**flp1969** · 03-25-2022

Well... working with single byte charsets isn't a real problem with portability... nowadays we have a difficult one: In Windows you can work with WINDOWS-1252 charset (a modified ISO-88591-1 charset) and in Linux and other UNIXes is common to use UTF-8 (a multi byte charset).

Take the character † as an example... In WINDOWS-1252 is 0x86, doesn't exist in ISO-88591-1, but in UTF-8 is U+2020 or 0xe2 0x80 0xa0 (3 bytes long).

One solution is to convert charsets using some library like libiconv and ask the library to ignore or transliterate characters from one charset that don't exist to another. Another solution is to use wchat_t, instead of char, as unit for characters, and the functions declared at wchar.h. Typically wchat_t is 32 bits long and supports any UNICODE codepoints.

Notice that C is a multi byte charset capable language (described in ISO-9899), you can, with no problems, declare something like this:

Code:

char s[] = "مرحبا بالعالم"; // "Hello, world" in Arabic

But this is UTF-8... You can do:

Code:

wchar_t s[] = L"مرحبا بالعالم"; // "Hello, world", using wchar_t

First case (s with char):

Code:

$ objdump -s test.o
test.o:     file format elf64-x86-64

Contents of section .data:
 0000 d985d8b1 d8add8a8 d8a720d8 a8d8a7d9  .......... .....
 0010 84d8b9d8 a7d984d9 8500               ..........

Second, with wchar_t:

Code:

$ objdump -s test.o
test.o:     file format elf64-x86-64

Contents of section .data:
 0000 45060000 31060000 2d060000 28060000  E...1...-...(...
 0010 27060000 20000000 28060000 27060000  '... ...(...'...
 0020 44060000 39060000 27060000 44060000  D...9...'...D...
 0030 45060000 00000000

**flp1969** · 03-25-2022

There are routines in wchar.h to convert to/from wchar_t to/from local charset, but you have to set locale properly, typically:

Code:

  setlocale( LC_ALL, "" );

**G4143** · 03-25-2022

A little off topic.. I remember back in the days when you had to have a ASCII text editor to write code in C. I guess with the proliferation of multi-byte characters and unicode character sets that has been upped to a unicode editor.

**flp1969** · 03-25-2022

Originally Posted by G4143

A little off topic.. I remember back in the days when you had to have a ASCII text editor to write code in C. I guess with the proliferation of multi-byte characters and unicode character sets that has been upped to a unicode editor.

Yep, but it is nice to know that multi byte charsets can be traced back to Dennis Ritchie and Ken Thompson in the 70's (UNICODE was designed based on experiences made by XEROX in the 80's -- as standard it arrived at the 90's). And since C99 UNICODE (and multi byte charsets) are supported by ISO 9899 standard (see 5.2.1.2 and the normative Annex D).

Thread: ASCII and EBCDIC

Thread Tools

Search Thread

Display

ASCII and EBCDIC

Similar Threads

Trying to understand format of old FTP connection using EBCDIC

write a variable block EBCDIC file

EBCDIC to ASCII

ASCII character with ASCII value 0 and 32

EBCDIC files

Tags for this Thread