Thread: Reading 2 csv files and combining their content into an output

  1. #1
    Registered User
    Join Date
    Aug 2012
    Posts
    78

    Reading 2 csv files and combining their content into an output

    Hello there!

    I am new to C/C++ and I'm trying to solve a simple problem efficiently:

    I would like to read two very data-intensive .csv files, format and combine the content into a .txt file.

    The first file input1.csv (~6 GB in the real scenario) is of the following form:
    # rows # features
    feature-value:feature-id, feature-value:feature-id,....
    feature-value:feature-id, feature-value:feature-id,....
    feature-value:feature-id, feature-value:feature-id,....
    ....
    ....
    The second file input2.csv (~70 MB in the real scenario) is of the following form:
    # rows # labels
    label-id:1.0, label-id:1.0,....
    label-id:1.0, label-id:1.0,....
    label-id:1.0, label-id:1.0,....
    ....
    ....
    The final output of the processing-function is output.txt, which shall contain both features and labels (separated by tabs), but the formatting is slightly different (the colon : is replaced by !) and a line enumeration is added:
    1\tfeature-value:feature-id, feature-value:feature-id,....\tlabel-id:1.0, label-id:1.0,....
    2\tfeature-value:feature-id, feature-value:feature-id,....\tlabel-id:1.0, label-id:1.0,....
    ....
    ....
    #rows\tfeature-value:feature-id, feature-value:feature-id,....\tlabel-id:1.0, label-id:1.0,....
    Now, I have the following questions:

    1) Does it make sense to use C++ on such huge files? I was told that Python and MATLAB would take much longer.
    2) I have been trying to read the csv file using the function strtok, but the fread automatically reads 'commas' as well, whereas I would only like to write only the value in addition to the exclamation mark to the output file.

    Output of the c++ reading/writing code (as given in the link below) performed on input2.csv:

    12,8,
    5:1.0,,
    1:1.0,4:1.0,
    2:1.0,3:1.0,
    3:1.0,8:1.0,
    4:1.0,,
    5:1.0,,
    ,,
    2:1.0,,
    3:1.0,,
    4:1.0,7:1.0,
    3:1.0,4:1.0,5:1.0
    1:1.0,5:1.0,
    Links: input1.csv, input2.csv
    I have uploaded the sample output file for an overview.
    Attached Files Attached Files

  2. #2
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    You shouldn't be using fread or strtok. They are C functions.
    This can be done easily in C++.
    One way I can think of is to use std::getline to read a full line. Then, you can construct a string stream from the string you read, and then construct an istream_iterator from that. You can then use std::copy to copy the items out of the line into a vector to hold your items using a back_insert operator (you can use std::back_inserter to make one) as the third parameter, the output iterator to std::copy.
    Then write everything back into an output file from the strings in your vector, putting whitespace and newlines as you desire.

    I could make an example, but I also want this to be a learning experience for you to read the documentation and get familiar with these features in C++.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  3. #3
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    Quote Originally Posted by Elysia View Post
    You shouldn't be using fread or strtok. They are C functions.
    This can be done easily in C++.
    One way I can think of is to use std::getline to read a full line. Then, you can construct a string stream from the string you read, and then construct an istream_iterator from that.
    Hey Elysia! Thanks a lot for your help! I managed to get enumeration + tab separation + reading parts of each line done. However, I am facing some issues with the last entry of the line (the stopping character is wrong) due to which the entire syntax is messed up.


    Code:
    int main ()
    {
    
    
        ifstream lbl ("input1.csv");                                 // declare lbl stream
        ifstream ft ("input2.csv");                                 // declare ft stream
    
    
        // 1) Get number of data points
        string datapoints, dummy;
        getline(lbl, datapoints, ',');                                 // read very first value: # data points
        cout << "datapoints: " << string(datapoints) << endl;         // display number of datapoints
        double count = atof(datapoints.c_str());
    
        getline(lbl, dummy, '\n');                                     // read and drop second value
        getline(ft, dummy, '\n');                                    // read and drop whole line
    
    
        // 2) Start writing into output file
        string outfilename = "output.txt";
        FILE* fout = fopen(outfilename.c_str(), "w");
    
        string ftid;
        string ftvalue;
        for (int row = 1; row <= count; row++)
        {
            // enumerator
            fprintf(fout, "%d\t", row);
    
            // features
            getline(ft, ftid, ':');
            getline(ft, ftvalue, ',');
    
            cout << string(ftid) << "!" << string(ftvalue) << ",";
    
    
            // labels
    
    
    
            // new line
            cout << "\n";
            fprintf(fout, "\n");
        }
        fclose(fout);
    
    
    
              return 0;
    }

    Corresponding console output (which will be written into output.txt once it's correct):
    datapoints: 12
    1!4,
    2!5,
    3!7,
    4!4,
    5!7

    1:4,
    2!3,
    3!6,
    4!4,
    5!6

    1:4,
    2!4,
    3!7,
    4!4,
    And where is the rest?
    Can you help me out here?

  4. #4
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,472
    one way you can help yourself out is by not writing C+ code
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  5. #5
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    I'm trying my best to familiarize myself with a new language. Please bear with me.

    So - I tried making use of stringstream.

    Code:
        // 2) Start writing into output file
        string outfilename = "train.txt";
        FILE* fout = fopen(outfilename.c_str(), "w");
    
        string ftline;
        for (int row = 1; row <= count; row++)
        {
            // enumerator
            fprintf(fout, "%d\t", row);
    
            // features
            getline(ft, ftline, '\n');
            stringstream ss(ftline);
    
            string ftid, ftvalue;
    
            while (!ss.eof())
            {
                getline(ss, ftid, ':');
                if (ftid.empty() == false)
                {
                    cout << string(ftid) << "!";
                }
    
                getline(ss, ftvalue, ',');
                if (ftvalue.empty() == false)
                {
                    cout << string(ftvalue) << ",";
                }
            }

    The output is messed up - once again:

    datapoints: 12
    1!4,2!5,3!7,4!4,5!7
    ,1!4,2!3,3!6,4!4,5!6
    ,1!4,2!4,3!7,4!4,5!7
    ,1!2,3!5,4!7,5!4,8!3
    ,2!3,4!4,5!7,,
    !7,1!4,3!7,4!6,,
    !6,,,,,
    !2!4,8!7,,,
    !7,3!1,4!4,6!7,,
    !7,1!5,2!8,,,
    !8,1!1,2!5,,,
    !5,3!4,4!5,6!8,,!8,

  6. #6
    Registered User rogster001's Avatar
    Join Date
    Aug 2006
    Location
    Liverpool UK
    Posts
    1,472
    what exactly is the problem you are having? you are still mixing c with c++, Elysia showed you routes to the docs for many useful c++ features, you should read those, just browse the pages. Also your original theme of make a .csv into a .txt - is completely redundant - the data is simply data.
    Thought for the day:
    "Are you sure your sanity chip is fully screwed in sir?" (Kryten)
    FLTK: "The most fun you can have with your clothes on."

    Stroustrup:
    "If I had thought of it and had some marketing sense every computer and just about any gadget would have had a little 'C++ Inside' sticker on it'"

  7. #7
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Do not mix C-style I/O with C++ I/O streams unless you have a good reason to do so. This:
    Code:
    FILE* fout = fopen(outfilename.c_str(), "w");
    // ...
    fprintf(fout, "%d\t", row);
    should be:
    Code:
    std::ofstream fout(outfilename.c_str());
    // ...
    fout << row << '\t';
    You should not use eof() to control a loop like that. I would expect something like:
    Code:
        while (getline(ss, ftid, ':'))
        {
            if (!ftid.empty())
            {
                cout << ftid << "!";
            }
    
            if (getline(ss, ftvalue, ','))
            {
                if (!ftvalue.empty())
                {
                    cout << ftvalue << ",";
                }
            }
            else
            {
                break;
            }
        }
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  8. #8
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    1) Does it make sense to use C++ on such huge files? I was told that Python and MATLAB would take much longer.
    I'm not completely sure I understand your file format. But consider the following python code:
    Code:
    def process_line(line):
        return '!'.join(line.split(':')).strip('\n');
    Which can be called on all the lines of both files in something like:
    Code:
    for line, line2 in zip(file1, file2):
    You can print output as you go, but this assumes that file 1 and file 2 have the same number of lines. All considered, I think this is something python can do efficiently.
    Last edited by whiteflags; 05-06-2013 at 04:17 PM. Reason: file format is way simpler than i thought

  9. #9
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    Quote Originally Posted by whiteflags View Post
    You can print output as you go, but this assumes that file 1 and file 2 have the same number of lines. All considered, I think this is something python can do efficiently.
    Hi whiteflags! Thanks for the advice! Since I've started with C/C++ now, I would like to continue down this line. Might use python for some similar processing that comes up.

  10. #10
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    Quote Originally Posted by laserlight View Post
    Do not mix C-style I/O with C++ I/O streams unless you have a good reason to do so. This:
    Code:
    FILE* fout = fopen(outfilename.c_str(), "w");
    // ...
    fprintf(fout, "%d\t", row);
    should be:
    Code:
    std::ofstream fout(outfilename.c_str());
    // ...
    fout << row << '\t';
    Thanks! I've made the corresponding changes. I think I read on stackoverflow somewhere that the C variant is much faster. But if it causes compatibility issues, I'd be happy to omit it.
    Last edited by in_ship; 05-07-2013 at 12:00 AM.

  11. #11
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Do not assume C is faster in any way or form until you can prove it. Keep yourself in the domain of C++. Worry about possible optimizations only if you need it.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  12. #12
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    Quote Originally Posted by Elysia View Post
    One way I can think of is to use std::getline to read a full line. Then, you can construct a string stream from the string you read, and then construct an istream_iterator from that. You can then use std::copy to copy the items out of the line into a vector to hold your items using a back_insert operator (you can use std::back_inserter to make one) as the third parameter, the output iterator to std::copy.
    Then write everything back into an output file from the strings in your vector, putting whitespace and newlines as you desire.
    Alright - I found a way to integrate stringstream, istream_operator and copy. Here's the current version of the code:
    Code:
    #include<iostream>
    #include<sstream>
    #include<fstream>
    #include<iterator>
    using namespace std;
    
    template <class T>
    
    class csv_istream_iterator: public iterator<input_iterator_tag, T>
    {
    istream * _input;
     char _delim;
    string _value;
    public:
    
     csv_istream_iterator( char delim = ',' ): _input( 0 ), _delim( delim ) {}
    
     csv_istream_iterator( istream & in, char delim = ',' ): _input( &in ), _delim( delim ) { ++*this; }
    
    const T operator *() const {
            istringstream ss( _value );
           T value;
           ss >> value;
           return value;
        }
     istream & operator ++() {
    
           if( !( getline( *_input, _value, _delim ) ) )
            {
                _input = 0;
           }
           return *_input;
     }
    booloperator !=( const csv_istream_iterator & rhs ) const {
       return _input != rhs._input;
        }
    };
    template <>
    const string csv_istream_iterator<string>::operator *() const {
    
     return _value;
    }
    int main( int argc, char * args[] )
    {
          ifstream ft( "input1.csv" );                                // declare ft stream
    
    
     
     
    // 1) Get number of data points
    
    string datapoints, dummy;
     getline(ft, datapoints, ',');               // read very first value: # data points
    
     cout << "datapoints: " << string(datapoints) << endl;// display number of datapoints
    
    double count = atof(datapoints.c_str());
    
     getline(ft, dummy, '\n');                       // read and drop second value
    
     
     
    // 2) Start writing into output file
    
     string outfilename = "out.txt";
     
     
    ofstream fout(outfilename.c_str());
    for (int row = 1; row <= count; row++)
     {
    // enumerator
    
     fout << row << '\t';
    
       if( ft )
        {
              copy( csv_istream_iterator<string>( ft ),
                              csv_istream_iterator<string>(),
                              ostream_iterator<string>( cout, "," ) );
        }
    }
     
    fout.close();
    return 0;
     
    }
     
    The output is as follows:

    1 1:4,2:5,3:7,4:4,5:7
    ,
    2 1:4,2:3,3:6,4:4,5:6
    ,
    3 1:4,2:4,3:7,4:4,5:7
    ,
    4 1:2,3:5,4:7,5:4,8:3
    ,
    5 2:3,4:4,5:7,,
    ,
    6 1:4,3:7,4:6,,
    ,
    7 ,,,,
    ,
    8 2:4,8:7,,,
    ,
    9 3:1,4:4,6:7,,
    ,
    10 1:5,2:8,,,
    ,
    11 1:1,2:5,,,
    ,
    12 3:4,4:5,6:8,,
    However, there are a few things which are disturbing and must be mended: 1) All these extra commas. If there's no additional character, the commas must go away. 2) I have not made use of back_inserter yet. Where does that come in?
    Last edited by in_ship; 05-07-2013 at 01:17 AM.

  13. #13
    Registered User
    Join Date
    Aug 2005
    Location
    Austria
    Posts
    1,990
    First of all your code is not readable, learn to properly indent your code.
    Second it doesn't compile because of a missing header
    Third the output is different to what youre showing.
    Kurt

  14. #14
    Registered User
    Join Date
    Aug 2012
    Posts
    78
    Quote Originally Posted by ZuK View Post
    First of all your code is not readable, learn to properly indent your code.
    Second it doesn't compile because of a missing header
    Third the output is different to what youre showing.
    Kurt
    I've uploaded the .cpp, the first .csv file and the output .txt file. You may test it.
    Attached Files Attached Files

  15. #15
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    You didn't (and don't) have to write your own iterator. You can use ones already existing in the standard library. The one you have written just mimics istream_iterator anyway.
    You don't need to manually close files. They close automatically when they go out of scope. Learn to rely on this behaviour. It is a cornerstone in C++ called RAII.
    back_inserter creates a back_inserter iterator which you can use to push back items into a container. For example:

    Code:
    	std::ifstream in("myfile.txt");
    	std::vector<std::string> Lines;
    	std::copy(std::istream_iterator<std::string>(in), std::istream_iterator<std::string>(), std::back_inserter(Lines));
    	std::copy(Lines.begin(), Lines.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
    A vector does not automatically contain storage, so you can't just write elements to it directly. So you can use a back inserter, which calls container.push_back(elem) for every element.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Combining files together
    By binks in forum C Programming
    Replies: 41
    Last Post: 04-29-2012, 04:16 PM
  2. GUIs and combining files
    By sharrakor in forum C Programming
    Replies: 2
    Last Post: 03-22-2009, 07:00 AM
  3. Combining files
    By TheDan in forum C++ Programming
    Replies: 5
    Last Post: 04-07-2006, 07:18 AM
  4. Replies: 14
    Last Post: 04-06-2006, 12:18 AM
  5. Combining multiple wav files into one
    By eam in forum Tech Board
    Replies: 3
    Last Post: 01-17-2005, 11:08 AM