bad characters in excel saved CSV file

This is a discussion on bad characters in excel saved CSV file within the C++ Programming forums, part of the General Programming Boards category; I have a problem with a set of data that is causing getline() to collect incomplete strings. The data is ...

  1. #1
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,425

    bad characters in excel saved CSV file

    I have a problem with a set of data that is causing getline() to collect incomplete strings.

    The data is exported as csv from a so called 'cloud' application, blah bloody blah, and it is formatted fine and runs fine if the data file is left alone after export, this data export is a weekly task, so if anyone else exporting it in my abscence decides to view it and then save as csv using excel, then the format it saves in will mess up the program output.
    to cover this eventuality how can this be caught programatically? It is like the linefeed is being incorrectly interpreted.

    I checked in notepad and no special characters are in the data, i used Word and showed formatting marks, nothing in there except end of line and raised dot symbol for spaces.

    The line in the data is
    Test Test, emailaddress@provider.com, sitename, status
    And getline is reading :
    t, emailaddress@provider.com, sitename, status
    the remaining lines are incomplete in a similar way, except some of them pick up the whole second name in the first field and discard the first, like:
    line in data is:
    name1 name2, emailaddress@provider.com, sitename, status
    getline collects:
    name2, emailaddress@provider.com, sitename, status
    however it is most strange as if i use a second call to getline underneath and a temporary string variable the line is collected fine...

    Prior to saving in excel the data exports with strings surrounded by quotes, i saw these are removed after saving in excel, but i do not really think this is the problem as other source data the program uses uses files exported from a different database application that does not add the quotes in the first place anyway.
    Last edited by rogster001; 05-27-2011 at 03:16 AM.
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  2. #2
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,558
    Well how are you using getline() ?
    What does "mess up" actually mean?
    Do you have "before" and "after" CSV files you could attach to your post (or at least point us to)?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  3. #3
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,425
    the data is customer info so i will have to see if i can amend a couple of the records, save in notepad to preserve the 'good' version, then get the 'bad ' version using excel. Will add an edit in a moment... The thing is i tested in a small seperate program and found no problem... I also have just seen that perhaps i should explicitly declare the <string> include, as well as namespace std. they are included in the other header listbuilder.h, but maybe that will fix it...oh forget that rubbish, they were already written in, was looking at wrong file

    By 'mess up output' i mean the strings will later on be evaluated incorrectly as they are incomplete, so the application will match several hundred records it should ignore

    getline i am using like this:

    Code:
    #include <string>
    #include "UI/fluidUI.h"
    #include "ListBuilder.h"
    
    using namespace std;
    
    void ListBuilder::GetData()
    {
        bool all_over = false; //prevent results window showing unless all operations are complete for current work.
    
        if(is_startup)
        {
            if(bookedCust)
                salesFdata.open(sfPath.c_str(),ios_base::in); 
            else
                salesFdata.open(sfDraftPath.c_str(),ios_base::in);
    
            if(!CheckOutputFilesOpen())
            ShowError(FILE_FAIL);
        }
    
        if(salesFdata.is_open())
        {
            if(is_startup)
            {
                getline(salesFdata, sfPrevLine); //store the column names row first
                GetStartValues();
            }
            while(sfThisLine.length())
            {
                while((running) && (!done))
                {
                    if(running)
                    {
                        if(!was_paused)  //sfThisLine below is just my class std::string member
                            getline(salesFdata, sfThisLine); //the next line in file for comparison in function... this is where i am reading the problem
                        if(bookedCust)
                            all_over = GetBookedData();  //do the matching work and check if finished
                        else
                            all_over = GetDraftData();
    
                       was_paused = false;
                    }
                }
                if(!running || done)
                break;
            }
            if(done)
            {
                CheckFileClose();
                mainWin->deactivate();
    
                if(filterNew)
                OutputNewCustomers();
                if(all_over)
                ShowResultsMessage();
            }
        }
        else
        {
            if(bookedCust)
                errString = sfPath;
            else
                errString = sfDraftPath;
    
            startBtn->copy_label("Start"); 
            clickState = true;
            ShowError(FILE_FAIL);
        }
    
    }
    I have added a text copy of a couple of lines as they appear after export, i cannot upload the same lines saved as CSV as its not in the filetypes list, so i saved as txt to show how it looks after saving in excel
    Attached Files Attached Files
    Last edited by rogster001; 05-27-2011 at 04:06 AM. Reason: ADD Files
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  4. #4
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    I have no idea what system you're on, but: the two files as you uploaded them would appear to have different end-of-line marks (the after excel file looked "normal" on this Windows machine, while the as exported file was all on one line with funny symbols between). Does running something like dos2unix on the excel version change things?

  5. #5
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,425
    When you say looked 'normal' which program did you use when viewing the symbols ? was the funny symbol like a little square? as i believe this is known to microsoft word at least as an end of line or carriage return symbol
    however when i open them in word the before version has the speech marks as expected, the show formatting marks button reveals the 'normal' backwards p shaped carriage return symbol.
    the after version is same except there are no speech marks.

    I am using windows 7, office 2003, compiling with gcc, have tested on a number of win7 machines and also an old windows 2000, same result.

    dos2unix does not seem to be an option for me

    I have added an instruction to the help for the program saying do not interact with the source data other than to download it and overwrite the old files. Thats best i can do for now!
    Last edited by rogster001; 05-27-2011 at 06:56 AM.
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  6. #6
    a_capitalist_story
    Join Date
    Dec 2007
    Posts
    2,651
    as exported:

    Code:
    00000000: 2244 6544 7570 654b 6579 3a20 4163 636f  "DeDupeKey: Acco
    00000010: 756e 7420 4e61 6d65 222c 2244 6544 7570  unt Name","DeDup
    00000020: 654b 6579 3a20 4465 4475 7065 4b65 7922  eKey: DeDupeKey"
    00000030: 2c22 4f72 6465 7220 5374 6174 7573 222c  ,"Order Status",
    00000040: 2253 6974 6522 0a22 5445 5354 2054 4553  "Site"."TEST TES
    00000050: 5422 2c22 4a4c 452e 4552 5249 4d40 4547  T","JLE.ERRIM@EG
    00000060: 414e 2e43 4f2e 554b 222c 2244 7261 6674  AN.CO.UK","Draft
    00000070: 222c 224d 5722 0d0a 0a22 4144 414d 2050  ","MW"..."ADAM P
    00000080: 4849 4c49 5022 2c22 4144 414d 2d50 3140  HILIP","ADAM-P1@
    00000090: 534b 594f 4e45 2e43 4f4d 222c 2241 6363  SKYONE.COM","Acc
    000000a0: 6570 7465 6422 2c22 4d57 220d 0a0a 2257  epted","MW"..."W
    000000b0: 494c 4c20 414c 414e 222c 2241 4c41 4e57  ILL ALAN","ALANW
    000000c0: 494c 4c40 4e45 542e 434f 4d22 2c22 4163  ILL@NET.COM","Ac
    000000d0: 6365 7074 6564 222c 224d 5722 0a20       cepted","MW".
    after excel save:
    Code:
    00000000: 4465 4475 7065 4b65 793a 2041 6363 6f75  DeDupeKey: Accou
    00000010: 6e74 204e 616d 652c 4465 4475 7065 4b65  nt Name,DeDupeKe
    00000020: 793a 2044 6544 7570 654b 6579 2c4f 7264  y: DeDupeKey,Ord
    00000030: 6572 2053 7461 7475 732c 5369 7465 0d0a  er Status,Site..
    00000040: 5445 5354 2054 4553 542c 4a4c 452e 4552  TEST TEST,JLE.ER
    00000050: 5249 4d40 4547 414e 2e43 4f2e 554b 2c44  RIM@EGAN.CO.UK,D
    00000060: 7261 6674 2c4d 570d 0a41 4441 4d20 5048  raft,MW..ADAM PH
    00000070: 494c 4950 2c41 4441 4d2d 5031 4053 4b59  ILIP,ADAM-P1@SKY
    00000080: 4f4e 452e 434f 4d2c 4163 6365 7074 6564  ONE.COM,Accepted
    00000090: 2c4d 570d 0a57 494c 4c20 414c 414e 2c41  ,MW..WILL ALAN,A
    000000a0: 4c41 4e57 494c 4c40 4254 494e 5445 524e  LANWILL@BTINTERN
    000000b0: 4554 2e43 4f4d 2c41 6363 6570 7465 642c  ET.COM,Accepted,
    000000c0: 4d57 0d0a                                MW..
    So I think your problem is right here: 0d 0a0a
    Last edited by rags_to_riches; 05-27-2011 at 06:57 AM.

  7. #7
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    Quote Originally Posted by rogster001 View Post
    When you say looked 'normal' which program did you use when viewing the symbols ? was the funny symbol like a little square? as i believe this is known to microsoft word at least as an end of line or carriage return symbol
    however when i open them in word the before version has the speech marks as expected, the show formatting marks button reveals the 'normal' backwards p shaped carriage return symbol.
    the after version is same except there are no speech marks.

    I am using windows 7, office 2003, compiling with gcc, have tested on a number of win7 machines and also an old windows 2000, same result.

    dos2unix does not seem to be an option for me

    I have added an instruction to the help for the program saying do not interact with the source data other than to download it and overwrite the old files. Thats best i can do for now!
    Just for completeness, I had opened it up in Notepad. The squares do stand for end-of-line-like symbols; but if you have the proper Windows line endings, you don't see any squares, just a new line, which is the point.

  8. #8
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,425
    So I think your problem is right here: 0d 0a0a
    Hmm i see this yes, no idea what to do about that though, even in the pre excel those dots should not be there, whoever created the csv export function on our cloudy application has added those in, typos methinks, i will have to see if that can be updated, still not clear to me why that breaks things so much though... jsut because it ruins the delimiting?? But that can't be it as the fileds are still recognised by excel.
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  9. #9
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    Quote Originally Posted by rogster001 View Post
    Hmm i see this yes, no idea what to do about that though, even in the pre excel those dots should not be there, whoever created the csv export function on our cloudy application has added those in, typos methinks, i will have to see if that can be updated, still not clear to me why that breaks things so much though... jsut because it ruins the delimiting?? But that can't be it as the fileds are still recognised by excel.
    It's a little bit worse than that, as that's the working version! Would it be possible, for the two data files provided, to see what your program does to them specifically?

  10. #10
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,425
    I think the best i can offer is a description as the complete application is several MB now.

    The aim of the program feature in question is to check each customer history and find any that have only draft status against them. If an occurence of 'accepted' or 'completed' is found in their history they will not be included in the output.
    The file is opened and gets the next instance of current record 'sfThisLine'
    the program is running 'draft' mode so you see the if statement diverting flow (incidentally the same problem happens for 'booked' anyhow)
    A vector temporarialy stores each record as a 'history' for the current customer until name changes
    Once the customer name changes, another function returning a bool is called to check this history,
    if an instance of status that is not 'draft' is found in the history the customer is immediately ignored and the code lumbers on, back out into the loop that you saw in the posted code

    the string 'sfPrevLine' stores the new customer name, whilst sfThisLine' goes ahead to collect each instance of that customer in the file

    name field and email field are used as two unique identifiers and these portions of strings sfPrevLine and sfThisLine are compared in a function to see if customer has changed

    SO to answer more clearly, the algorithm will fail if the line is not being read from the correct starting point each time, as when it is out of step of broken names mean that customer histories are not examined properly as the names will not match up.
    Last edited by rogster001; 05-27-2011 at 09:35 AM.
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  11. #11
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    Right, I just meant, for the small data files there, what the chunk of code generates (i.e. what the successive calls to getline get), not worrying about the processing that happens later.

  12. #12
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,425
    haha ok rite, they just get the whole line, that you see in the files, ie first line collected by sfThisLine is as in notepad:

    Code:
    "DeDupeKey: Account Name","DeDupeKey: DeDupeKey","Order Status","Site"
    except of course in the debugger or stdout this line you would see the escape characters \" too
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Scoreboard saved in .txt file?
    By Newklear in forum C Programming
    Replies: 14
    Last Post: 03-21-2011, 08:06 PM
  2. weird symbol saved into the file
    By gknight in forum C Programming
    Replies: 2
    Last Post: 04-01-2010, 09:35 PM
  3. Location of Saved File
    By loopshot in forum C++ Programming
    Replies: 13
    Last Post: 01-22-2006, 10:26 PM
  4. Replies: 3
    Last Post: 11-08-2005, 06:25 AM
  5. reading saved file
    By firefly in forum C++ Programming
    Replies: 1
    Last Post: 07-09-2005, 01:24 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21