Thread: Is serialization needed, if portability isn't important?

  1. #1
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657

    Is serialization needed, if portability isn't important?

    Consider I have a class template taking some arbitrary type as its argument and doing binary IO using memory mapped files.
    I don't do any serialization and just dump objects to be copied to a memory location.
    I can see that a LOT of time can be potentially saved, if I get a (typecasted) 'raw' pointer from the mapped location to represent the original object.
    What kind of issued might I run into with this approach ?
    Do the problems, if any, exist if I use only POD types ?
    If I use the same compiler and runtime (different versions..a possibility ) , are padding or C++ specific stuff going to be a problem ?

    [PS: Is there something like a 'soft link' so that the post appears on C++ and Linux Programming both ?]

  2. #2
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    The answers to your questions unfortunately just depend on what you are trying to do.

    Even for trivial types you need to be aware of ABI issues. For example, some compilers allowing for `long double' to use the 80 bits floating point type available to the x86 family always pad the variable to 96 or 128 bytes while others will only pad when it rests between types that would result in the variable being misaligned while still others may never pad the 80 bits leaving it misaligned.

    The point is, you may or may not have issues even for simple data types if you don't pay attention to the way those simple types may change under different compiler options.

    Because you are talking about templates I'll assume you are also implying classes that have methods in which case other ABI issues will eat you alive if you aren't aware of them.

    If the compilers building the final binaries use exactly the same ABI:

    If this is just for communication, you need to remove any trivial allocations (from new and friends) in the code for the template class so that you can provide an allocator that works with the storage space associated with the memory mapped file. If the underlying types allocate memory you'll have to provide an interface to do anonymous clones grabbing storage from the memory mapped file. If the underlying types are simple you'll be fine. You have to do this because the memory grabbed from normal allocators are not going to be available to other binaries under most any operating system.

    If the compilers building the final binaries do not use exactly the same ABI:

    If this is just for communication, you'll need to come up with a way to completely separate storage concerns from mechanism concerns like the bridge pattern and then the above applies to the data only portion.

    Otherwise:

    If this is just for a private allocator range for performance purposes, you don't need to do anything fancy. If you've design your allocator properly the template class will have the performance when the underlying type uses the same allocator but will still degrade gracefully when the underlying type uses other allocators.

    If this is for storage using a memory mapped file for performance reasons, you'll need to either be extremely careful using allocators making sure to always use the memory mapped file aware allocator and use primitives that know how to serialize themselves or you'll need to do some serialization of of pointers and complex data structures whether you want to or not.

    I don't think you need this thread linked; memory mapped files are widely implemented.

    Soma
    Last edited by phantomotap; 04-12-2012 at 09:30 AM.

  3. #3
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Thanks. It'll take some time for me to digest that answer...I'll post back asking for clarification later, if necessary.

  4. #4
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Quote Originally Posted by phantomotap View Post
    The answers to your questions unfortunately just depend on what you are trying to do.
    I'll use it to share resources and do some 'basic' IPC.

    I decided to not make things complicated, as my requirements are very simple.
    So, to get rid of the allocator issues, I changed the design a little (the class is no longer a template, but provides a template method for data access) to dictate a precondition that the default constructor for the typename in the template argument must make an object of the maximum dimensions needed for any object for the class.
    That does not prevent me from using ..say.. the standard containers... only I can't increase their size.
    Writing a general purpose allocator for memory mapped files would take a lot of work to get it right (without even thinking about performance). (I'll do it someday, after going through a lot of theory on it)
    I can't see a way out for pointers (I may consider not allowing them, and insisting on fixed size arrays) other than serialization to prevent shallow copy.

    I don't understand much about the ABI issues... other than that they potentially exist (..what I meant by "C++ specific stuff" in the OP) .
    All compilers for my OS, afaik, follows something called the "Generic C++ ABI", does that give me a safeguard?
    Last edited by manasij7479; 04-12-2012 at 03:13 PM.

  5. #5
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    That does not prevent me from using ..say.. the standard containers... only I can't increase their size.
    It can. I suggest you try a few examples and see for yourself.

    All compilers for my OS, afaik, follows something called the "Generic C++ ABI", does that give me a safeguard?
    It does; knowing from the outset that you'll probably require the use of the same compiler and libraries also helps.

    However, that's why I brought up size of `long double'. "GCC" can use multiple different modes of alignment and padding for `long double' depending on command line options used when linking. This violates that "ABI" you are talking about and the manual clearly states the issue, but "GCC" doesn't usually complain about options that violate that standard.

    My point wasn't to scare you away but to warn you to be aware of the issues so that you can make effort to avoid them from the outset by noting any options that you need to use that changes the "ABI".

    Soma

  6. #6
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Quote Originally Posted by phantomotap View Post
    It can. I suggest you try a few examples and see for yourself.
    It can, if I get careless and throw a std::string at it... whose default constructor makes an empty string.
    I just have to insert the binary representation of the filled object directly into the file to use the std::strings, std::vectors, std::map..etc.
    However, this, an std::array works as expected:
    Code:
        mm::MFile mf("a.dat");
        auto& myarray = mf.get_obj<std::array<int,5>>(0); // 0 is the offset, in bytes.. replacing it with another no. works equally fine.
        
        for(int i=0;i<5;i++)myarray[i]=i;
        
        for(auto x : mf.get_obj<std::array<int,5>>(0))
            std::cout<<x<<std::endl;
    Last edited by manasij7479; 04-12-2012 at 03:45 PM.

  7. #7
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Basically, raw memory transfer never works for anything with embedded pointers. You miss the referenced data, and even if you don't, you have to allocate fresh memory and adjust addresses on deserialization, which defeats the point of raw memory transfer.
    std::string, std::vector, and all other containers except std::array have internal pointers.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  8. #8
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Quote Originally Posted by CornedBee View Post
    Basically, raw memory transfer never works for anything with embedded pointers. You miss the referenced data, and even if you don't, you have to allocate fresh memory and adjust addresses on deserialization, which defeats the point of raw memory transfer.
    Then, I'll have to do with arrays.. and 'less dynamic' classes made by me and have a good interface exposed to standard algorithms which will serve most of the purposes.
    If I get around to writing an allocator, it'll be (probably be) as slow, if not slower than serialization and back...if something like linked lists are put there. std::vectors will waste too much space on disk if I want to avoid relocating often.

    std::string, std::vector, and all other containers except std::array have internal pointers.
    What for, though ?
    If I were to write anything but the linked/tree structures, I wouldn't use pointers anywhere.

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    std::vector is in essence (leaving out allocators and such stuff):
    Code:
    template <typename E>
    class vector {
      E *m_begin;
      E *m_end;
      E *m_memory_end;
    
    public:
      size_type size() const { return m_end - m_begin; }
      size_type reserved() const { return m_memory_end - m_begin; }
      // etc.
    };
    How else would a dynamically growing array work except by having a pointer to a dynamically allocated memory block?

    std::string is essentially the same as std::vector, though possibly with some additional tricks (small inline buffer for short strings, or CoW functionality).

    std::deque is a pretty complicated structure; typically it consists of a dynamically allocated array of pointers, each of which points to a fixed-size block of elements (also dynamically allocated, of course).

    std::list is of course a doubly-linked list and has lots of individual nodes. std::forward_list is a singly-linked list. std::map and std::set are trees that also have lots of nodes.

    The std::unordered_* containers are basically forced into being chaining hash tables, so they consist of a dynamically allocated array of pointers (the bucket array) to what are essentially doubly-linked list nodes containing the values.

    The rule of thumb is: if its size isn't fixed at compile time, it has dynamic allocation and thus contains pointers. Only std::array is fixed at compile time.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #10
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    I can see that a LOT of time can be potentially saved
    Nah. A simple serialization scheme for data in vectors, etc, is only going to be a couple steps from raw. You don't have to turn it into XML or something. Methinks trying to get around this falls into the category of "pointless and awkward optimization".

    I'm also having a hard time seeing how a "raw dump" of vector data would work. How does the other party know how long the vector is, etc? By the time you resolve those kinds of issues, you're pretty much at a protocol.

    I presume there is some generic method for easy serialization of standard types in C++ standard containers. If there isn't, someone should write one.
    Last edited by MK27; 04-13-2012 at 05:56 AM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #11
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    I suggest that you look at google protocol buffers. they use a very fast, portable binary format, and the documentation and support are excellent.

  12. #12
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Quote Originally Posted by MK27 View Post
    I'm also having a hard time seeing how a "raw dump" of vector data would work.
    Nah.. it wouldn't. I (wrongly) had the idea that vectors and strings store arrays, not pointers.
    I presume there is some generic method for easy serialization of standard types in C++ standard containers. If there isn't, someone should write one.
    Boost has one... I'm yet to learn it though.

    Methinks trying to get around this falls into the category of "pointless and awkward optimization".
    Consider a use case of what I'm doing.
    Two processes are using the same data.. say.. a 'very' big table . (I'll use shm_open() for smaller cases instead of files.. but the procedure remains same)
    One can edit it and the other displays it...or does something fancy with it.
    In my way, the only time loss is when the editor locks the file for writing, which will be quite frequent.
    But If I have to do 'any' kind of serialization, this operation takes about 2 cycles of serialization and back.
    Do you have a better way ?
    Last edited by manasij7479; 04-13-2012 at 08:08 AM.

  13. #13
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Quote Originally Posted by Elkvis View Post
    I suggest that you look at google protocol buffers. they use a very fast, portable binary format, and the documentation and support are excellent.
    Thanks, I'll look into it.

  14. #14
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Do you have a better way ?
    MK27 was almost certainly referring to the fact that there are infinitely many ways to serialize data.

    You could go for full serialization that breaks components down into a form suitable for long term archival and transmission.

    You could also reason that as this code is on the same box you only need to serialize for raw binary transmission.

    So for example, you don't necessarily serialize a `deque' as the lists of vectors it probably is underneath with the need to account for padding, endianness, alignment, or arbitrary size. (If you were serializing for long term storage these are things that need to be considered.) You can just serialize the data as a specific fixed size array with a length prefix.

    Of course, it will be slower than some alternatives to interprocess communication, but no method is free and easy.

    You may want to look into "STXXL" or the numerous `std::allocator' compatible components for memory mapped files open source programmers have made available.

    Soma

  15. #15
    [](){}(); manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    *nullptr
    Posts
    2,657
    Quote Originally Posted by phantomotap View Post
    You can just serialize the data as a specific fixed size array with a length prefix.
    That is my point, now.
    Often, I would not need much more than a plain fixed array of n dimensions to cover everything.
    Higher level data structures can be built up in memory using particular indices of the array as pointers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C++ Serialization
    By SevenThunders in forum C++ Programming
    Replies: 4
    Last Post: 04-29-2008, 03:12 AM
  2. Replies: 3
    Last Post: 06-12-2007, 11:21 AM
  3. Serialization yay!
    By Shamino in forum C++ Programming
    Replies: 11
    Last Post: 06-10-2007, 05:53 PM
  4. Serialization
    By Asagohan in forum C++ Programming
    Replies: 8
    Last Post: 10-11-2005, 10:57 PM
  5. Program Portability (code portability)
    By Perica in forum C++ Programming
    Replies: 2
    Last Post: 11-10-2002, 10:03 AM