Thread: loading array w/Chinese chars

  1. #1
    Registered User
    Join Date
    Mar 2003
    Posts
    30

    Question loading array w/Chinese chars

    Wow, what's the record for the biggest amount of posts on the same part of the same program? Here I am again...making progress, but now I can't figure out why Chinese Win98 will display the characters (in DOS) in the one program but not in the other. I'm still toying with the idea of wchar_t but I guess I'd have to install XP for that one as Unicode only works with the NT kernel.

    OK, here's the program that displays characters correctly in Chinese Win98/DOS using getline():

    Code:
    #include <iostream>
    #include <cstdlib>
    #include <fstream>
    
    using namespace std;
    
    int main()
    { char arry[12][120]; 
      int count = -1;
      
      cout << "Loading radicals3.txt into the array...\n" << endl;
      
      ifstream infile("radicals3.txt");
      
      for(int a=0; a<12; a++)
      { infile.getline(arry[++count], 120, '\n');
      }  
      
      for(int b=0; b<7; b++)
      { cout << "\narry [" << b << "] [" << arry[b] 
             << "]" << endl;
      }  
        
      system("pause");  
        
      for(int b=7; b<12; b++)
      { cout << "\narry [" << b << "] [" << arry[b] 
             << "]" << endl;
      }    
        
        
      system("pause");
      return 0;
    }
    Now here's the code (using get()) for the section of my program that will load in letters of the alphabet and display them perfectly but won't load(?) or display when I try it with Chinese characters:

    Code:
    void Hanzi::DisplayRads(int r)
    { 
      ifstream infile;
      ostringstream name;
          //int i = r;
      name << r << "str.txt" << flush;
      cout << "\nname is " << name.str() << "\n" << endl;
      string name2 = name.str();
      infile.open(name2.c_str());
      
      if(!infile)
        cout << "Aw, bummer, the file didn't open!\n";
      else
      { char arry[120]; 
        //infile.getline(arry[0], 60, '\n');
        char ch;
        //long index = 0;
        
        //infile >> ch;
        
        for(int c=0; c<120; c++)   //this lets it feed in one char @ a time
          infile.get(arry[c]);
        
        //while(!infile.eof())
        //{
        //    arry[index++] = ch;
        //}
        
        for(int b=0; b<20; b++)
        { if((b==0) || (b%2 == 0))
            cout << b+1 << ". " << arry[b];
          else
            cout << "\t" << b+1 << ". " << arry[b] << endl;
        }        
      }        //if(infile) else
      infile.close();
    }                      //Hanzi::DisplayRads
    Any ideas? Thanks guys. I love this board!

    Swaine777

  2. #2
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    You can't use single character input because your characters may occupy multiple bytes.

    You're dealing with text that is encoded. There are many ways for the computer to deal with it; yours uses some form of MBCS (multibyte character set). What YOU see as a single character is actually somewhere between one and four bytes long.

    The problem with reading a single char (a la get()) is that one char variable may only be part of one character on the screen. You're reading in only half of the character (or one third or one fourth).

    There's no really easy thing to do; reading in whole strings is simple and seems to work, but reading one character out of a string is not as easy. You'd need to learn more about the encoding used by your machine.

  3. #3
    Registered User
    Join Date
    Mar 2003
    Posts
    30

    thanks

    Thanks a lot, Cat. I know that programming with Chinese characters is best done with the NT kernel and Unicode, what with the big, fat characters and all. I guess I'm going to have to reload XP on my system and try to learn to do some Windows/Unicode programming. I have a tutorial or 2...just have to keep my nose to the grindstone and learn it. I love Win98 but it's just not enough at this point, I guess.

    Can you suggest any really good Windows/Unicode C++ tutorials, books, etc.?

    Thanks again,

    Swaine777

  4. #4
    Comment your source code! Lynux-Penguin's Avatar
    Join Date
    Apr 2002
    Posts
    533
    actually, for the chinese character set, I am pretty sure it uses only two bytes. I know for the Japanese char set it does.

    In order to get input from a text file, you need to include some type of unicode support and use the functions with that.

    Unicode sets (for Japanese anyway) comes in pairs, first byte then second byte. I am pretty sure Chinese is the same way.

    http://www.google.com/search?q=unicode+support+in+c
    Asking the right question is sometimes more important than knowing the answer.
    Please read the FAQ
    C Reference Card (A MUST!)
    Pointers and Memory
    The Essentials
    CString lib

  5. #5
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Actually, that's not correct about japanese in Unicode. Firstly, there is more than one encoding for Unicode. Under UTF-16, which is what WinNT uses internally, the entire japanese character set can indeed fit into one code point (16 bits). Under UTF-8, which is widely used to send information over networks, and is used by older computers, a single japanese character will be anywhere between one code point (8 bits) or four code points (32 bits).

    Actually, that's not technically true, one GLYPH will be between 1 and 4 bytes under UTF-8, but there's not exactly a 1:1 correspondance between glyphs and characters.

    Under Chinese Simplified (GB2312) characters occupy either one or two bytes. I would wager this is the character set you're using, Swaine. Some info on it is here:

    http://www.domainisland.com/di/html/charsets_gb2312.htm

    It doesn't seem that hard to handle those characters. Perhaps a class like this? I can't test as I can't run under that code page.

    Code:
    #include <iostream>
    
    class Char2312{
    private:
    	unsigned char buffer[2];
    	friend std::istream & operator>>(std::istream&, Char2312&);
    	friend std::ostream & operator<<(std::ostream&, Char2312&);
    };
    
    std::istream & operator>>(std::istream& i, Char2312& c){
    	c.buffer[0] = i.get();
    	if (c.buffer[0] > 0x80)
    		c.buffer[1] = i.get();
    	return i;
    }
    
    std::ostream & operator<<(std::ostream& o, Char2312 &c){
    	o.put(c.buffer[0]);
    	if (c.buffer[0] > 0x80)
    		o.put(c.buffer[1]);
    	return o;
    }
    
    int main(){
    	Char2312 c;
    	std::cout << "Enter a character and press enter: ";
    	std::cin  >> c;
    	std::cout << "You entered " << c << std::endl;
    }
    It's also possible that you're using Big5 (http://www.domainisland.com/di/html/charsets_big5.htm) but the code to read a single character seems to be the same.

    Edit: Bugfix.

    BTW, I just tested a version of that code (modified to account for a different encoding scheme) in a console program using japanese Shift-JIS encoding (which I have enabled on my system) and it worked just fine.

    The class can certainly be expanded, and another useful class would be some form of string class so that you could read and write more than one character at a time.

    Something as simple as this could be used to incorporate your characters into std::string objects (using the + or += operators):

    Code:
    std::string & operator+=(std::string &s, Char2312& c){
    	s += c.buffer[0];
    	if (c.buffer[0] > 0x80)
    		s += c.buffer[1];
    	return s;
    }
    
    
    std::string & operator+(std::string s, Char2312& c){
    	return s+= c;
    }
    Last edited by Cat; 07-22-2003 at 11:06 AM.

  6. #6
    Comment your source code! Lynux-Penguin's Avatar
    Join Date
    Apr 2002
    Posts
    533
    hmm, i never knew that...
    I just remember working with a windows font called mtgothic or something. And Unicode with it.

    Thanks for the info.

    I have this program called NJStar Japanese that uses all sorts of encoding and from what I can see, all of the encodings use double byte...

    only the ASCII is single byte.

    I also have a weird version of the same program for Chinese but I can't tell the encoding or anything...

    だから知らないの

    That is double-byte for EUIC and Shift-JIS according to the encoding:
    Example:
    Character: 知
    EUIC - C3CE
    Shift-JIS - 926D
    Kuten - 3546
    Unicode - 77E5
    and all those listed are double byte char sets... (I am pretty sure but not completely)

    I gave up working with double-char sets on windows... It became too troublesome but if you find an easy way I would be glad to learn of it ^_^

    Thanks for the info, I will be sure to read it.

    -LC
    Asking the right question is sometimes more important than knowing the answer.
    Please read the FAQ
    C Reference Card (A MUST!)
    Pointers and Memory
    The Essentials
    CString lib

  7. #7
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Unicode using UTF-16 is very easy to handle using Windows NT, as there is native support for it and almost every glyph occupies only one code point.

    And all shift-JIS encodings of Japanese text are not necessarily two byte. Apart from the ANSI characters, there are halfwidth kana characters like ル (0xD9) that are single byte. Under BIG5 or GB2312 the ansi characters are one byte and the chinese symbols are two, but a good system should allow the two to be mixed; you shouldn't assume the next character will occupy two bytes, or assume it will occupy one.

    In either case, using either BIG5 or GB2312, if the first byte is larger than 0x80, there is a second byte to the character. It is relatively simple to read and write one character at a time, if a little tedious, using the method I showed in my last post or an equivalent method.
    Last edited by Cat; 07-23-2003 at 10:54 AM.

  8. #8
    Comment your source code! Lynux-Penguin's Avatar
    Join Date
    Apr 2002
    Posts
    533
    hey thanks for the info.

    I didn't know this.

    -LC
    Asking the right question is sometimes more important than knowing the answer.
    Please read the FAQ
    C Reference Card (A MUST!)
    Pointers and Memory
    The Essentials
    CString lib

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Loading data into a dynamic array
    By serge in forum C++ Programming
    Replies: 9
    Last Post: 06-18-2009, 03:36 PM
  2. question about multidimensional arrays
    By richdb in forum C Programming
    Replies: 22
    Last Post: 02-26-2006, 09:51 AM
  3. Simple question about array of chars
    By Chewy in forum C Programming
    Replies: 9
    Last Post: 04-12-2004, 05:13 AM
  4. Reading a file and loading it into an array?
    By Dual-Catfish in forum C++ Programming
    Replies: 2
    Last Post: 12-26-2001, 03:35 PM
  5. Loading an array using pointers
    By cheesehead in forum C++ Programming
    Replies: 5
    Last Post: 12-10-2001, 05:23 AM