how to get the word in nonEnglish text file

This is a discussion on how to get the word in nonEnglish text file within the C Programming forums, part of the General Programming Boards category; Hi, everyone. I met some problem when processing some text in Chinese.The file has been preprocessed so that the words ...

  1. #1
    Registered User
    Join Date
    Dec 2008
    Posts
    5

    Question how to get the word in nonEnglish text file

    Hi, everyone.
    I met some problem when processing some text in Chinese.The file has been preprocessed so that the words are separated from each other,so I want to retrieve each word and figure out the frequency of the words in the file.Since Chinese words often consists of several characters while char type in C language only holds one Byte,I wonder how I can handle it.

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    You need code that can handle UTF (Unicode) - "proper" UTF is 32-bit per character, but to save space, it's often stored in UTF-8 format, which consists of one or more 8-bit characters, where the highest bit indicates the number of bytes to follow (so 0..127 is single byte "ascii" characters, 128-255 is the first part of a multibyte set).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Registered User
    Join Date
    Dec 2008
    Posts
    5
    Thank you for your quick reply,but I am not so familiar with the code part,can you kindly explain it more specifically.How can I get the word in unicode?

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I'm not quite sure what to explain - to read non-european text, you will need to read multiple characters, and then decode it into a Unicode character code - which is a 32-bit value. You will need to know how the original content is encoded (e.g. UTF-8, UTF-16 or UTF-32).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    Registered User
    Join Date
    Dec 2008
    Posts
    5
    Thanks anyway! I will try it.

  6. #6
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,699
    We can help you more - but we need to know how the file is encoded. Once we know how it is encoded, then we'll know how to look for the word separator. Then you could just use memcmp to determine word equality.

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Counting the characters from each word from a text file
    By flipguy_ph in forum C Programming
    Replies: 6
    Last Post: 04-27-2009, 06:56 PM
  2. C++ std routines
    By siavoshkc in forum C++ Programming
    Replies: 33
    Last Post: 07-28-2006, 01:13 AM
  3. Wrong Output
    By egomaster69 in forum C Programming
    Replies: 7
    Last Post: 01-28-2005, 06:44 PM
  4. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 10:54 AM
  5. Ok, Structs, I need help I am not familiar with them
    By incognito in forum C++ Programming
    Replies: 7
    Last Post: 06-29-2002, 10:45 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21