Thread: Read binary file into std::vector

  1. #1
    Registered User
    Join Date
    Mar 2020
    Posts
    11

    Read binary file into std::vector

    Hi,
    I am having a hard time finding the cleanest solution for the following problem.
    I want to read a binary file into an std::vector.
    Checking online I found the following solution:
    Code:
    ifstream infile(filename, std::ifstream::binary);
    
    infile.seekg(0, infile.end);     //N is the total number of doubles
    N = infile.tellg();              
    infile.seekg(0, infile.beg);
    
    std::vector<double> buf(N / sizeof(double));// reserve space for N/8 doublesinfile.read(reinterpret_cast<char*>(buf.data()), buf.size()*sizeof(double));

    which works just fine if my data is an array of doubles, but that can generate problems when used with user defined structs. In particular problems may arise because of a mismatch in the alignment of the data in the binary file and in the vector.

    In particular my binary file contains an array of the following type:

    Code:
    struct NormalizedPair
    {
        uint32_t ID1;
        uint32_t ID2;
        double distance;
    
        NormalizedPair()=default;
        NormalizedPair(uint32_t id1, uint32_t id2, double d):  ID1(id1), ID2(id2), distance(d) {}
    };
    Is there a universal approach to solve this?
    should I use a dedicated library? I am handling large binary files (~tens of GB),

    I looked everywhere but cannot find a 100% correct solution for what it seems a trivial problem of loading a file into a container.

    Thanks in advance!

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Is this binary file in a pre-defined format, or do you have control over the whole process, i.e., you just want a portable way to write to a file in a binary format that you can read from?

    If it is the latter, then I suggest a radically different approach: use a SQLite database. This way, you immediately eliminate the issue of incompatible binary formats because of things like doubles and alignment (or rather, you defer the problem to SQLite, but SQLite is sufficiently cross-platform, thoroughly tested, and widely used that this should be a non-issue), while opening other options for viewing your data (e.g., using a generic browser for SQLite databases). As a plus, you get the usual goodies for such relational database engines, e.g., you can easily define indices (probably important since you're dealing with "tens of GB"), while retaining the simplicity of a single database file (the limit for SQLite databases is 140 terabytes per database file, so even hundreds of GB would be no issue as long as you have the storage that can handle all that in a single file).

    If it is the former though, then my question would be why aren't you creating a std::vector<NormalizedPair> instead of a std::vector<double>? The binary format would still be not portable because double representations can differ, but there would be no "mismatch in the alignment of the data in the binary file and in the vector" for the same compiled program.
    Last edited by laserlight; 05-27-2020 at 05:36 PM.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Mar 2020
    Posts
    11
    The binary file is data generated by a colleague with a pre-defined format: raw array of NormalizedPair.
    It was not clear in my first post but I am actually using a std::vector<NormalizedPair>. The problem I was afraid of comes from the fact that the data was generated with an unknown version of the g++ compiler in a random OS and I am now reading the data with g++9 in Linux.
    I thought alignment could become an issue even though we are using the same struct.

    In the future I will have control over the generation of the data. For that case I will check SQLite. Is SQLite recommended over HDF5? The end result will be always the same, I want to consistently load the binary/SQLite/HDF5 data into an std::vector<NormalizedPair>. This is because I want to use a set of processing functions which takes as inputs C++ containers (user-defined but also std algorithms). Actually I should read more about SQLite API, maybe I can use it directly without needing a C++ container.

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Fede
    The binary file is data generated by a colleague with a pre-defined format: raw array of NormalizedPair.
    It was not clear in my first post but I am actually using a std::vector<NormalizedPair>. The problem I was afraid of comes from the fact that the data was generated with an unknown version of the g++ compiler in a random OS and I am now reading the data with g++9 in Linux.
    I thought alignment could become an issue even though we are using the same struct.
    My concern would be the representation of the double: as far as I know, IEEE 754 is not guaranteed. With the unknowns that you have listed, it seems to me that your task is iffy: if you don't know the format of a bit pattern even if you know it represents a double, how are you going to reliably interpret it? It'll be like someone asking you what is 1111 in "binary", and when you answer 15 they go "aha! wrong! it is -1 loser".

    On the other hand, since this is "pre-defined" even though the definitions are unknown to you (a colleague who has left, leaving behind source code and a huge store of useful binary data but with no details on how the program that generated the data was compiled?), it sounds like it isn't a moving target, so what you could do is just give it a try with some data, and if it consistently gives you results within expected parameters, great, then it probably does work.

    Quote Originally Posted by Fede
    Is SQLite recommended over HDF5?
    I am not familiar with HDF5, but a quick search shows that it is a hierarchical data format, which of course is different in paradigm from SQLite, which is a relational database. Hence, which to choose depends on your requirements: is your data more akin to files grouped into folders? Then perhaps HDF5 is a good option. But if it is more in the form of disparate entities related to each other, where the entities can be naturally represented in a tabular form, then perhaps SQLite is the better option.

    Quote Originally Posted by Fede
    Actually I should read more about SQLite API, maybe I can use it directly without needing a C++ container.
    That's possible, although I think it may be more common when SQLite is used as an in-memory database.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by Fede View Post
    The binary file is data generated by a colleague with a pre-defined format: raw array of NormalizedPair.
    It was not clear in my first post but I am actually using a std::vector<NormalizedPair>. The problem I was afraid of comes from the fact that the data was generated with an unknown version of the g++ compiler in a random OS and I am now reading the data with g++9 in Linux.
    I thought alignment could become an issue even though we are using the same struct.

    In the future I will have control over the generation of the data. For that case I will check SQLite. Is SQLite recommended over HDF5? The end result will be always the same, I want to consistently load the binary/SQLite/HDF5 data into an std::vector<NormalizedPair>. This is because I want to use a set of processing functions which takes as inputs C++ containers (user-defined but also std algorithms). Actually I should read more about SQLite API, maybe I can use it directly without needing a C++ container.

    If you don't have control over the output file then your only real choice is to read it in binary and just do trial and error adjustments to account for alignment, data-widths, endian-ness, and padding.

    Otherwise, I'd recommend using a serialization library. A good one will pack the data pretty efficiently.

    You could also go with plain text files. Not nearly as compact but much easier to code up. For something like that, fprintf() and fscanf() for example could be used.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C++ / Qt5.8 - Read file into char vector
    By mad-hatter in forum C++ Programming
    Replies: 4
    Last Post: 03-04-2017, 10:22 AM
  2. Replies: 12
    Last Post: 06-18-2012, 08:23 AM
  3. Read data from file and put into vector
    By optimus203 in forum C++ Programming
    Replies: 4
    Last Post: 04-08-2012, 03:41 AM
  4. Read numbers from file to std::vector
    By Petike in forum C++ Programming
    Replies: 2
    Last Post: 12-24-2010, 04:33 AM
  5. How to read a file stream entirely into a vector?
    By jiapei100 in forum C++ Programming
    Replies: 4
    Last Post: 01-06-2008, 03:22 PM

Tags for this Thread