Thread: ISAPI unicode filenames

  1. #1
    Registered User
    Join Date
    Nov 2009
    Posts
    8

    Unhappy ISAPI unicode filenames

    Hi good people! I have a unicode problem; I have created a ISAPI DLL in C++ that takes in a filename and then sends a file to the requesting client. (The DLL is executed from a browser.)

    Problem is that when I send in a Japanese filename (where each character need two bytes) the application believes the file doesn't exist.

    I'm trying to get a file named "日本語.jpg".
    This is turned into ANSI characters in the application: "日本語.jpg".

    Then when I try to open the file:

    Code:
    HANDLE fileHandle = CreateFile(
      filePath,  //the file path, char *filePath
      GENERIC_READ,
      FILE_SHARE_READ,
      NULL,
      OPEN_EXISTING,
      FILE_ATTRIBUTE_NORMAL,
      NULL
    );
    ...it doesn't find the file since it checks for the ANSI version of the Unicode file name. If I rename the file to "日本語.jpg" and then try to get "日本語.jpg" I can get the file (but that means I have to rename the files to weird symbols).

    The question is:

    Can I make it so that the CreateFile function understands that "filePath" is Unicode and not ANSI? Or is there some other way to get a HANDLE to the correct file?

    Desperate for help, I just discovered this and if there is no way of fixing it it will mean I have wasted weeks coding for a broken system!

  2. #2
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    try casting your char* to a wchar_t* and passing it to the CreateFileW() function, which is the unicode version of CreateFile(). the other possibility would be to set the option that makes your project use unicode as its default character encoding, instead of ansi/ascii. then CreateFileW would get called automatically because of the macro definitions that happen behind the scenes in a unicode project.

  3. #3
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> try casting your char* to a wchar_t* ...
    I believe you meant "converting".

    Casting from one to the other should never be done. If the codepage of the char* string can't represent the letters in the filename, then information is already lost. On Windows, it's best to handle the filePath as Unicode (WCHAR or wchar_t string) from the beginning.

    gg

  4. #4
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    Quote Originally Posted by Codeplug View Post
    >> try casting your char* to a wchar_t* ...
    I believe you meant "converting".

    Casting from one to the other should never be done. If the codepage of the char* string can't represent the letters in the filename, then information is already lost. On Windows, it's best to handle the filePath as Unicode (WCHAR or wchar_t string) from the beginning.

    gg
    while I would normally agree with you, it's clear in this case that the filename stored in the OP's char array contains the wchar_t data that he needs to pass to CreateFileW(). the best solution to this is just to always handle filenames as unicode, since you know that you may, at any time, come across a filename containing unicode characters.

  5. #5
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> it's clear in this case that the filename stored in the OP's char array contains the wchar_t data that he needs
    It's clear to me that it isn't... You can see character's after '.'. Even if there was wchar_t data in his char array, that only means that the rule has already been broken - don't cast one to the other.

    After converting "日本語" to the two most likely multi-byte character sets, UTF8 and codepage 932, I found that the UTF8 bytes correspond to the codepage 1250 glyphs: "—œž".

    So the Op should take his UTF8 pathname, convert it to UTF16LE via MultiByteToWideChar, then call CreateFileW.

    Sample C++ code for converting: wcsrtombs_s question

    gg

  6. #6
    Registered User
    Join Date
    Nov 2009
    Posts
    8

    Talking

    Got it working, thanks!

    What I'm doing is this right before I use CreateFileW:

    Code:
    wchar_t dst[1000];
    MultiByteToWideChar(CP_UTF8, 0, filePath, 1000, dst, 1000);
    "filePath" is "char *filePath" in the function, which was declared as "char filePath[1000]" outside the function. I assume "wchar_t dst" doesn't need more than 1000 in length. Or does it need double (or half)?

    First I try to open the file with CreateFileA, if that fails I try to open the file with CreateFileW. If I didn't do it like that I couldn't open file named like so: ".jpg".

    I don't really get why CreateFileW works with japanese characters and normal characters (a-z) but not with "" or "" for example. CreateFileW works when the file name is "日本語.jpg". I guess the DLL is being sent characters encoded differently?

    Well, it works like it is now, by using both CreateFileA and CreateFileW. Always thought it was strange that Windows uses UTF16 and not UTF8 which I think look superior.

    Thanks again gentlemen!

    Edit: Sorry for my late reply, I have been busy and haven't checked for replies until just now.

  7. #7
    Registered User
    Join Date
    Nov 2009
    Posts
    8
    This is odd, some seem to work other doesn't seem to work. Also it seem to work for some on Windows XP that doesn't work at all on Windows Vista.

    Here's a few file names that I can't get a handle for:

    ( ゚ 3゚) ( -`) (☞゚∀゚)☞ ( ゚ Д゚).jpg
    (`.FILENAME.).jpg
    ( ゚Д゚)ァハハ八八ノヽノヽノヽノ \ \ \.jpg
    -`~☆★BLAH★☆~`-.jpg
    █▄ ██ █▄.jpg
    ಥ﹏ಥ.jpg

    Any ideas? Those are strange file names I know but it would be cool if they worked as well.

    In addition, on Windows XP I can access "日本語.jpg" (prints as "日本語.jpg"), but on Windows Vista I can't access "日本語.jpg" (prints as "???.jpg"). I don't know what is going on.

  8. #8
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> First I try to open the file with CreateFileA, if that fails I try to open the file with CreateFileW
    CreateFileA simply calls CreateFileW after converting the given filename from the ACP to UTF16.

    So either the text is encoded in UTF8 or some codepage. If you don't know, then all you can do is guess.

    Ideally you would know what you have, convert it explicitly to UTF16, and call CreateFileW.

    gg

  9. #9
    Registered User
    Join Date
    Nov 2009
    Posts
    8
    I thought "CreateFile" was a alias for "CreateFileA"? But there is a difference between those two?

    And you are saying that "CreateFileA" converts to UTF16 automatically, thus it is the same as "CreateFileW"? Is there any need for "CreateFileW" if "CreateFileA" already does it in UTF16?

    What alternatives is there to "CP_UTF8" that I could try? Is there some way to find out what I should use from analyzing "char *filePath"?

    Really thankful if you help me out.

  10. #10
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    "CreateFile" is just a macro that resolves to either CreateFileA or CreateFileW, depending on if UNICODE is defined or not. The 'A' version assumes that the given string is encoded using the codepage returned by GetACP().

    Trying to figure out how the strings are encoded will never be reliable. The best solution is to fix whatever is giving you the strings to always use UTF8/16.

    gg

  11. #11
    Registered User
    Join Date
    Nov 2009
    Posts
    8

    Unhappy

    I have located the problem. When I access the site...
    http://localhost:85/image.x?日本語.jpg
    ...on my Vista machine, running IIS 7.0, a HTML page containing...
    <img src="???.jpg">
    ...is generated. No image is found.

    However when I access the site...
    http://192.168.0.200:1337/image.x?日本語.jpg
    ...on my XP machine, running IIS 5.1, a HTML page containing...
    <img src="日本語.jpg">
    ...is generaed. Image is found.

    If I access...
    http://localhost:85/日本語.jpg
    ...directly on my Vista machine the image is found.

    My ISAPI dll is executed both when I access "image.x?<query>" and when I access the image directly (the only difference is that the DLL generates and then returns a html page in the first case and returns a image in the second case).

    Interestingly, if I access...
    http://192.168.0.200:1337/日本語.jpg
    ...on my XP machine the image will be found, while when I access...
    http://localhost:85/日本語.jpg
    ...on my Vista machine the image will not be found.

    Since the DLL get the requested URL from the IIS server my conclusion is that it gives the string encoded in some weird way in IIS 7.0... the million dollar question is how can one find out what encoding is used?

    You say that the best solution would be to make sure IIS always use UTF8 when passing me the URL. I agree, but I have no idea how to do that. Any ideas out there? Is there a setting in IIS I should know about? This feels like magic to me.
    Last edited by Mackan2009; 11-26-2009 at 05:46 PM. Reason: removed clickable urls

  12. #12
    Registered User
    Join Date
    Nov 2009
    Posts
    8

    Post

    This is driving me a little mad, and it has come to a point where I just want to work around the problem instead of trying to fix it in C code, which seems impossible.

    I've been reading
    Getting the correct Unicode path within an ISAPI filter — kirit.com
    which are discussing ISAPI and unicode. However even when I use UNICODE_URL as recommended in the article I cannot reach 日本語.jpg on IIS 7.0. When using UNICODE_URL the URL should be in UTF16 apparently, I tried:

    Code:
    char pathToFile[1000]
    DWORD pathToFileSize = sizeof(pathToFile);
    pECB->GetServerVariable(pECB->ConnID, "UNICODE_URL",
    	pathToFile, &pathToFileSize);
    
    HANDLE fileHandle;
    fileHandle = CreateFileA(pathToFile, ~etc...);
    if (fileHandle==INVALID_HANDLE_VALUE) {
    	wchar_t unicodeFilePath[1000];
    	MultiByteToWideChar(CP_UTF8, 0, pathToFile, 1000, unicodeFilePath, 1000);
    	fileHandle = CreateFileW(unicodeFilePath, ~etc...);
    }
    But it didn't work. CP_UTF16 doesn't exist either, tried writing the number 1200 as this table...
    Constant Field Values (POI API Documentation)
    ...shows, I assume its correct to do so but its probably not, sigh. Whatever.

    It takes ages to test this stuff out also, I make changes on Win XP where all my project files are and then I have to transfer and set everything up on Vista just to see that it made no change. Probably won't work on Server 2003 later where this project will run its life.

    So, what I'm thinking now is that maybe I can work around these ancient C problems with Unicode with a little JavaScript, nay?

    On the HTML page, on Vista where the image URL turns up as "???.jpg", if I run this:
    <script>alert(document.location.href);</script>
    I see the japanese letters. So if I somehow can turn these japanese letters from "日本語" to "日本語" in JavaScript I can stick them into the image, replacing "???" and thus showing the image. With luck the end user wont even know the switch has been made, I know it's a cheezy solution but I just want it to work at this point. Won't be that many unicode file names anyway.

    So,

    anyone here know how "日本語" could be transformed to "日本語" using JavaScript?

    I know this is a C board but maybe I'll get lucky. Of course I'm still open for good solution to the real problem also (making it possible to find the file in the ISAPI extension).

  13. #13
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Unicode characters on Windows are stored in 16bit variables, wchar_t or WCHAR, and are encoded as UTF16LE. All the UTF's are a encoding of Unicode characters.

    A "codepage" is just a table that assigns numbers to glyphs. Codepage strings in Windows are stored in 8bit variables, char or CHAR. There are "single-byte" codepages, like 1250, and there are "multi-byte" codepages, like 932 (notice that 0x81 just redirects to another 8bit table).

    You can think of Unicode as a single "codepage" for all the worlds languages.

    MultiByteToWideChar == "Codepage to UTF16"
    WideCharToMultiByte == "UTF16 to Codepage"

    CP_UTF8 isn't a real codepage. It's just a value that was created for the sole purpose of converting between UTF8 and UTF16, using the above 2 functions.

    When you get a variable that begins with "UNICODE_", you should pass in storage for a wchar_t/WCHAR string. You can then call CreateFileW directly.

    See the sample code here: http://msdn.microsoft.com/en-us/library/ms525335.aspx

    On your test machines, make sure you are using a filesystem that supports Unicode: http://www.ntfs.com/ntfs_vs_fat.htm

    gg

  14. #14
    Registered User
    Join Date
    Nov 2009
    Posts
    8

    Lightbulb

    Okay I think I get it now. Thank you very much!

    Though I have come to a decision, I'll not support unicode characters. I began to make the neccessary changes, but the DLL works as it is now with ANSI characters and it'll take too long to change it now for it to be worth my time. A lot of char strings need to be merged with the unicode wchar_t... If I had made the app for unicode from the start it might have been a different story.

    Also I spent a lot of time yesterday to figure out how to convert UTF8 strings to ANSI strings, got it working so now the images can be downloaded (though the file names might not look too good). But for every 1 weird looking character there will be 1000000 that looks okay so it doesn't matter that much, as long as the image is downloadable.

    Thanks a lot again, I have learned much. Perhaps I will make the neccessary changes to add unicode support to the DLL in the future.

  15. #15
    Registered User
    Join Date
    Nov 2009
    Posts
    8

    Exclamation

    Follow up: It's working good now. On IIS5.1 that is. Surprise-surprise there is some security setting in IIS7.0 that prevents certain weird URLs from being reachable.

    For example, reaching "ಥ﹏ಥ.jpg" can be done through this URL:
    http://localhost:85/ಥ﹏ಥ.jpg

    This works on IIS5.1, however on IIS7.0 it gives the following error:
    "Bad Request - Invalid URL<hr>HTTP Error 400. The request URL is invalid."

    Christ. The culprit in this is the "" character between and , maybe it shows up as a whitespace character on these forums but in the URL it doesn't show up at all, I guess it's a character representing DEL (ASCII #127) or something.

    Thing is it should work, if only IIS7.0 didn't have some sort of security thing enabled which I can't bloody turn off (tried creating some keys in the registry recommended by sites but it doesn't work). Big question is if it will work on IIS6.0 later where the DLL will live.

    I wish a byte was big enough to fit all unicode characters, haha.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. <string> to LPCSTR? Also, character encoding: UNICODE vs ?
    By Kurisu33 in forum C++ Programming
    Replies: 7
    Last Post: 10-09-2006, 12:48 AM
  2. Unicode - a lot of confusion...
    By Jumper in forum Windows Programming
    Replies: 11
    Last Post: 07-05-2004, 07:59 AM
  3. Should I go to unicode?
    By nickname_changed in forum C++ Programming
    Replies: 10
    Last Post: 10-13-2003, 11:37 AM
  4. UNICODE and windows.h help
    By nextus in forum Windows Programming
    Replies: 3
    Last Post: 03-02-2003, 03:13 PM
  5. UNICODE and GET_STATE
    By Registered in forum C++ Programming
    Replies: 1
    Last Post: 07-15-2002, 03:23 PM

Tags for this Thread