loading array w/Chinese chars

**Swaine777** · 07-19-2003

Wow, what's the record for the biggest amount of posts on the same part of the same program? Here I am again...making progress, but now I can't figure out why Chinese Win98 will display the characters (in DOS) in the one program but not in the other. I'm still toying with the idea of wchar_t but I guess I'd have to install XP for that one as Unicode only works with the NT kernel.

OK, here's the program that displays characters correctly in Chinese Win98/DOS using getline():

Code:

#include <iostream>
#include <cstdlib>
#include <fstream>

using namespace std;

int main()
{ char arry[12][120]; 
  int count = -1;
  
  cout << "Loading radicals3.txt into the array...\n" << endl;
  
  ifstream infile("radicals3.txt");
  
  for(int a=0; a<12; a++)
  { infile.getline(arry[++count], 120, '\n');
  }  
  
  for(int b=0; b<7; b++)
  { cout << "\narry [" << b << "] [" << arry[b] 
         << "]" << endl;
  }  
    
  system("pause");  
    
  for(int b=7; b<12; b++)
  { cout << "\narry [" << b << "] [" << arry[b] 
         << "]" << endl;
  }    
    
    
  system("pause");
  return 0;
}

Now here's the code (using get()) for the section of my program that will load in letters of the alphabet and display them perfectly but won't load(?) or display when I try it with Chinese characters:

Code:

void Hanzi::DisplayRads(int r)
{ 
  ifstream infile;
  ostringstream name;
      //int i = r;
  name << r << "str.txt" << flush;
  cout << "\nname is " << name.str() << "\n" << endl;
  string name2 = name.str();
  infile.open(name2.c_str());
  
  if(!infile)
    cout << "Aw, bummer, the file didn't open!\n";
  else
  { char arry[120]; 
    //infile.getline(arry[0], 60, '\n');
    char ch;
    //long index = 0;
    
    //infile >> ch;
    
    for(int c=0; c<120; c++)   //this lets it feed in one char @ a time
      infile.get(arry[c]);
    
    //while(!infile.eof())
    //{
    //    arry[index++] = ch;
    //}
    
    for(int b=0; b<20; b++)
    { if((b==0) || (b%2 == 0))
        cout << b+1 << ". " << arry[b];
      else
        cout << "\t" << b+1 << ". " << arry[b] << endl;
    }        
  }        //if(infile) else
  infile.close();
}                      //Hanzi::DisplayRads

Any ideas? Thanks guys. I love this board!

Swaine777

**Cat** · 07-21-2003

You can't use single character input because your characters may occupy multiple bytes.

You're dealing with text that is encoded. There are many ways for the computer to deal with it; yours uses some form of MBCS (multibyte character set). What YOU see as a single character is actually somewhere between one and four bytes long.

The problem with reading a single char (a la get()) is that one char variable may only be part of one character on the screen. You're reading in only half of the character (or one third or one fourth).

There's no really easy thing to do; reading in whole strings is simple and seems to work, but reading one character out of a string is not as easy. You'd need to learn more about the encoding used by your machine.

**Swaine777** · 07-22-2003

Thanks a lot, Cat. I know that programming with Chinese characters is best done with the NT kernel and Unicode, what with the big, fat characters and all. I guess I'm going to have to reload XP on my system and try to learn to do some Windows/Unicode programming. I have a tutorial or 2...just have to keep my nose to the grindstone and learn it. I love Win98 but it's just not enough at this point, I guess.

Can you suggest any really good Windows/Unicode C++ tutorials, books, etc.?

Thanks again,

Swaine777

**Lynux-Penguin** · 07-22-2003

actually, for the chinese character set, I am pretty sure it uses only two bytes. I know for the Japanese char set it does.

In order to get input from a text file, you need to include some type of unicode support and use the functions with that.

Unicode sets (for Japanese anyway) comes in pairs, first byte then second byte. I am pretty sure Chinese is the same way.

http://www.google.com/search?q=unicode+support+in+c

**Cat** · 07-22-2003

Actually, that's not correct about japanese in Unicode. Firstly, there is more than one encoding for Unicode. Under UTF-16, which is what WinNT uses internally, the entire japanese character set can indeed fit into one code point (16 bits). Under UTF-8, which is widely used to send information over networks, and is used by older computers, a single japanese character will be anywhere between one code point (8 bits) or four code points (32 bits).

Actually, that's not technically true, one GLYPH will be between 1 and 4 bytes under UTF-8, but there's not exactly a 1:1 correspondance between glyphs and characters.

Under Chinese Simplified (GB2312) characters occupy either one or two bytes. I would wager this is the character set you're using, Swaine. Some info on it is here:

http://www.domainisland.com/di/html/charsets_gb2312.htm

It doesn't seem that hard to handle those characters. Perhaps a class like this? I can't test as I can't run under that code page.

Code:

#include <iostream>

class Char2312{
private:
	unsigned char buffer[2];
	friend std::istream & operator>>(std::istream&, Char2312&);
	friend std::ostream & operator<<(std::ostream&, Char2312&);
};

std::istream & operator>>(std::istream& i, Char2312& c){
	c.buffer[0] = i.get();
	if (c.buffer[0] > 0x80)
		c.buffer[1] = i.get();
	return i;
}

std::ostream & operator<<(std::ostream& o, Char2312 &c){
	o.put(c.buffer[0]);
	if (c.buffer[0] > 0x80)
		o.put(c.buffer[1]);
	return o;
}

int main(){
	Char2312 c;
	std::cout << "Enter a character and press enter: ";
	std::cin  >> c;
	std::cout << "You entered " << c << std::endl;
}

It's also possible that you're using Big5 (http://www.domainisland.com/di/html/charsets_big5.htm) but the code to read a single character seems to be the same.

Edit: Bugfix.

BTW, I just tested a version of that code (modified to account for a different encoding scheme) in a console program using japanese Shift-JIS encoding (which I have enabled on my system) and it worked just fine.

The class can certainly be expanded, and another useful class would be some form of string class so that you could read and write more than one character at a time.

Something as simple as this could be used to incorporate your characters into std::string objects (using the + or += operators):

Code:

std::string & operator+=(std::string &s, Char2312& c){
	s += c.buffer[0];
	if (c.buffer[0] > 0x80)
		s += c.buffer[1];
	return s;
}


std::string & operator+(std::string s, Char2312& c){
	return s+= c;
}

**Lynux-Penguin** · 07-22-2003

hmm, i never knew that...
I just remember working with a windows font called mtgothic or something. And Unicode with it.

Thanks for the info.

I have this program called NJStar Japanese that uses all sorts of encoding and from what I can see, all of the encodings use double byte...

only the ASCII is single byte.

I also have a weird version of the same program for Chinese but I can't tell the encoding or anything...

‚¾‚©‚ç’m‚ç‚È‚¢‚Ì

That is double-byte for EUIC and Shift-JIS according to the encoding:
Example:
Character: ’m
EUIC - C3CE
Shift-JIS - 926D
Kuten - 3546
Unicode - 77E5
and all those listed are double byte char sets... (I am pretty sure but not completely)

I gave up working with double-char sets on windows... It became too troublesome but if you find an easy way I would be glad to learn of it ^_^

Thanks for the info, I will be sure to read it.

-LC

**Cat** · 07-22-2003

Unicode using UTF-16 is very easy to handle using Windows NT, as there is native support for it and almost every glyph occupies only one code point.

And all shift-JIS encodings of Japanese text are not necessarily two byte. Apart from the ANSI characters, there are halfwidth kana characters like Ù (0xD9) that are single byte. Under BIG5 or GB2312 the ansi characters are one byte and the chinese symbols are two, but a good system should allow the two to be mixed; you shouldn't assume the next character will occupy two bytes, or assume it will occupy one.

In either case, using either BIG5 or GB2312, if the first byte is larger than 0x80, there is a second byte to the character. It is relatively simple to read and write one character at a time, if a little tedious, using the method I showed in my last post or an equivalent method.

**Lynux-Penguin** · 07-24-2003

hey thanks for the info.

I didn't know this.

-LC

Thread: loading array w/Chinese chars

Thread Tools

Search Thread

Display

loading array w/Chinese chars

thanks

Similar Threads

Loading data into a dynamic array

question about multidimensional arrays

Simple question about array of chars

Reading a file and loading it into an array?

Loading an array using pointers