Serializing classes

**m37h0d** · 06-14-2011

i have two cents here:

if you're going to use text, use XML.

if you're going to use binary, consider picking up SQLite.

**phantomotap** · 06-14-2011

[Edit]
Okay, the post was edited after I responded, but this has apparently already been seen.

*shrug*

I don't know.

[/Edit]

But to turn that back into what it was, you can't just iterate and replace like you did before -- you are going to need a state machine parser, lookaheads, etc.

O_o

The only context in that serialization (a simple self-escaping character scheme) is the escape character and the very next character that follows the escape character. So, actually, you do pretty much do iterate and replace exactly as you did before.

In the great before, you scan and dump until you find a character that needs to be escaped where you dump the escape character followed by the character that represents the escaped character. You essentially perform a single lookup.

In reading it back in, you scan and dump until you find the escape character where you look at the next character and dump the actual value of the representation. You essentially perform a reverse lookup.

A simple transition table, one entry per escaped character, gets you everything you need.

This isn't something as complex as XML.

Soma

**King Mir** · 06-14-2011

Originally Posted by MK27

Think about the parsing getting that out. Here's a string:

"blah/
hah-ha"

So before serialization we iterate thru and turn that into:

"blah///nhah-ha"

Easy. But to turn that back into what it was, you can't just iterate and replace like you did before -- you are going to need a lookahead/lookback deal. Not so complicated, but unless there's a reason to do so...this is not going to make the task easier.

Or maybe that's just a matter of style. I defer, point taken. The OP needs to decide whether human readability is important or not, because the later is quicker and simpler.

It's quite easy, you copy characters one at a time, unless you read in the escape. character. If you read an escape character, you switch on the next character to decide what to do. If there is no next character, and possibly if the next character is not a valid escape, you report an error.

**phantomotap** · 06-14-2011

As a matter of interest, my posts weren't intended to offer support for the comments made by Elysia.

I realize now that my original intent was completely lost when I responded to what MK27 said directly instead of to the total context of the thread; instead I offered only a bit of flaky context to support what I didn't even explain.

*shrug*

I guess I need more sleep.

My intent in dropping by was to say that serialization is a complex field and there is no easy answer unless one does use a library designed specifically for that purpose. The real world will pretty much guarantee that it will not be as simple as "just use $X" even then.

Saying simply "use operators >> and <<" is pretty much as foolish and harmful as the other serialization related advice I very vocally despise.

Soma

**MK27** · 06-14-2011

Originally Posted by King Mir

It's quite easy, you copy characters one at a time, unless you read in the escape. character. If you read an escape character, you switch on the next character to decide what to do.

Yeah, that's what I meant by lookahead. I'll admit I've been a bit bombastic here, sorry. My point was while using >> and plain text is possible it is NOT the easy and efficient way and is only a good choice if you need human readability in the file*, which I have not seen the OP (Whyrusleeping‎) state that as a goal.

* or have some bizarre aversion to low level I/O and binary files.

**King Mir** · 06-14-2011

Originally Posted by MK27

Yeah, that's what I meant by lookahead. I'll admit I've been a bit bombastic here, sorry. My point was while using >> and plain text is possible it is NOT the easy and efficient way and is only a good choice if you need human readability in the file*, which I have not seen the OP (Whyrusleeping‎) state that as a goal.

* or have some bizarre aversion to low level I/O and binary files.

It's not lookahead. But yeah you don't want to use >> and << for strings, but instead copy character by character. Strings and character arrays would be a special case in that kind of setup.

As for whether you want human readability, the answer is always yes for unencrypted content. The question is is that readability important enough to trump other considerations. For a beginner, readability is particularly important, because it makes debugging that much easier.

**Whyrusleeping** · 06-14-2011

Well, with a little help from here and a lot of man page referencing i finally got around to what i wanted to do (sorry if i was ever unclear on what i was asking, ive never really worked with file i/o before unless you count pickling in python)

Code:

class datas
{
public:
  char filename[64];
  void s(const char *fname)
  {
    int l = strlen(fname);
    for(int i = 0; i < l; i++)
      {
	filename[i] = fname[i];
      }
    filename[l] = 0;
  }

  void add(const char *word)
  {
    ofstream file (filename, ios::out | ios::app | ios::binary);
    file.seekp(0, ios::end);
    int m = strlen(word) + 1;
    char num[16];
    sprintf(num, "%d", m);
    file.write (num, 4);
    file.write (word, m);
    file.close();
  }

  int search(const char *word)
  {
    ifstream file (filename, ios::in | ios::binary);
    int s = 0;
    int a = 0;
    char num[16];
    char w[32];
    while(! file.eof())
      {
	file.read(num, 4);
	a = atoi(num);
	file.read (w, a);
	if(!strcmp(word, w))
	  {
	    file.close();
	    return 1;
	  }
      }
    file.close();
    return 0;
  }
	
};



int main()
{
  datas test;
  test.s("dt.bin");
  system("rm dt.bin");
  int run = 1;
  char inp[64];
  char srch[64];
  int a = 0;
  while (run == 1)
    {
      cin.getline(inp, 64);
      a = strlen(inp);
      cout << a;
      inp[a + 1] = 0;
      if(!strcmp(inp, "!exit"))
	{
	  run = 0;
	  cout << "exiting\n";
	    }
	else if(!strncmp(inp, "!search", 7))
	  {
	    a = strlen(inp) - 8;
	    for(int i = 0; i < a; i++)
	      {
		srch[i] = inp[8 + i];
	      }
	    srch[a] = 0;
	    if(test.search(srch) == 1)
	      {
		cout << "found in list!\n";
	      }
	  }
	else
	  {
	    test.add(inp);
	  }
    }
    return 0;
}

basically, any word you type will be added to the file and typing !search will search the file to see if it contains that word.
[edit]
this strays from what i was orignally intending to do, but works much better for what i wanted
[/edit]

**King Mir** · 06-14-2011

I was feeling generous today. Remove all red, add all blue. Also, read comments.

Code:

class datas//conventionally classes should start with a capital letter. 
{
public:
  char filename[64];//file names can be longer than 63 characters. Use an std::string.
  void s(const char *fname)
  {
    int l = strlen(fname);
    for(int i = 0; i < l; i++)//what if l>63? Never write a program that can crash.
      {
	filename[i] = fname[i];
      }
    filename[l] = 0;
  }

  void add(const char *word)
  {
    ofstream file (filename, ios::out | ios::app | ios::binary);
    file.seekp(0, ios::end);
    int m = strlen(word) + 1;//you should just store the length.
    char num[16];//only need 11
    sprintf(num, "%d", m);
    file.write (num, 4);
    /*Write the first four digits of the length? It won't crash, but it will cause an 
       error. As long as you're aware, it's not vital to fix this. 
       But consider:*/
    file << setw(10) << m << std::ends;
    /*better not to uses ends, but this is what you intended. */
    file.write (word, m);
    /*Better to write a \n instead of \0, so a text editor can read it.*/
    file.close();
  }

  int search(const char *word)
  {
    ifstream file (filename, ios::in | ios::binary);
    int s = 0;
    int a = 0;
    char num[16]={'\0'};//ensures null termination.
    /*also, this should be 12 chars long, not 16*/
    char w[32];
    while(! file.eof())
      {
	file.read(num, 411);
	a = atoi(num);
        /*error if num is not null terminated. No program should crash when fed 
           bad data*/
	file.read (w, a);
        /* reading can fail, check for this so you don't use bad data.*/
	if(!strcmp(word, w))
        /*what if w>31? what if no null is read?*/
        if(std::string(w,a-1) == word)
        /*strings are safer, because they store their own length, and don't
           overflow*/
	  {
	    file.close();
	    return 1;
	  }
      }
    file.close();
    return 0;
  }
	
};



int main()
{
  datas test;
  test.s("dt.bin");
  /*since a data's must have a database file, this should be a constructor, not
     a method*/
  system("rm dt.bin");
  int run = 1;
  char inp[64];//just use an std::string
  char srch[64];
  int a = 0;
  while (run == 1)
    {
      cin.getline(inp, 64);
      a = strlen(inp);
      cout << a;
      inp[a + 1] = 0;//you don't want to do this.
      if(!strcmp(inp, "!exit"))
	{
	  run = 0;
	  cout << "exiting\n";
	    }
	else if(!strncmp(inp, "!search", 7))
	  {
	    a = strlen(inp) - 8;
	    for(int i = 0; i < a; i++)
	      {
		srch[i] = inp[8 + i];
	      }
	    srch[a] = 0;
	    if(test.search(srch) == 1)
	      {
		cout << "found in list!\n";
	      }
	  }
	else
	  {
	    test.add(inp);
	  }
    }
    return 0;
}

Also, everywhere, use better variable names.

**Whyrusleeping** · 06-14-2011

on the length of the input, i was meaning to cap it at 64, the database is meant to hold individual words, and i dont know any words that are 64 letters long...

**Elysia** · 06-15-2011

While I won't discourage you from what you're trying to do, this code is extremely dangerous. You need to patch it up a bit.

Code:

  void s(const char *fname)
  {
    int l = strlen(fname);
    for(int i = 0; i < l; i++)
      {
	filename[i] = fname[i];
      }
    filename[l] = 0;
  }

Why you feel you need to do this is beyond me. A simple solution would be:

Code:

  void s(const std::string& fname)
{
	filename = fname;
}

Btw, s is a very poor name for a member function.

Code:

    char num[16];
    sprintf(num, "%d", m);

Dangerous and a ticking time bomb. The size of an int is implementation defined, thus is also its length.
Furthermore, should you change %d to something else, or reduce the size of num, you can find yourself with buffer overruns.
A better approach might be:

Code:

std::string num = boost::lexical_cast<std::string>(m);

(Requires boost library.)

By getting rid of the C stuff, we can rewrite the add function:

Code:

  void add(const std::string& word)
{
    ofstream file (filename, ios::out | ios::app | ios::binary);
    file.seekp(0, ios::end);
    auto length = word.size();
    file << length << std::ends;
    file.write (word.c_str(), length + 1);
}

(file.close() is not necessary; the destructor will do it for us.)

We can also rewrite search:

Code:

int search(const std::string& word)
{
    ifstream file (filename, ios::in | ios::binary);
    int length = 0;

    for (;;)
    {
	file >> length;
	if (file.eof()) return 0;
	std::vector<char> buf(length + 1);
	file.read(&buf[0], buf.size()); // A null terminator was written to file, as well
	if (file.eof()) return 0;
	std::string _Word(buf.begin(), buf.end());

        if (word == _Word)
	    return 1;
    }
    return 0;
}

Once again, file.close is not necessary.
Undoubtedly, this has bugs in it, but it is much cleaner and safer than your C code.

**Whyrusleeping** · 06-15-2011

Hey, thanks for all the help. but im a little lost on what some lines do (it works great, i just want to understand it):
in this line what does setw(10) do? also, i was warned by quite a few people about using << for binary files.

Code:

file << std::setw(10) << length << std::ends;

and im not sure what the vector is doing here:

Code:

std::vector<char> buf(length);

also, in your rewritten search function, what happens when the word being searched for isnt in the file? it looks like an infinite loop

**King Mir** · 06-15-2011

Originally Posted by Whyrusleeping

on the length of the input, i was meaning to cap it at 64, the database is meant to hold individual words, and i dont know any words that are 64 letters long...

The thing is, you don't want your program to crash when sent bad data. It doesn't need to report an error, although that would be nice, but it should never crash. So that's why I suggest using std::string.

Similarly for the file name.

**King Mir** · 06-15-2011

Originally Posted by Elysia

Code:

    char num[16];
    sprintf(num, "%d", m);

Dangerous and a ticking time bomb. The size of an int is implementation defined, thus is also its length.
Furthermore, should you change %d to something else, or reduce the size of num, you can find yourself with buffer overruns.

Of all the things wrong with his code, assuming that int is 4 bytes isn't one that counts. That's a pretty reasonable assumption.

**King Mir** · 06-15-2011

Originally Posted by Whyrusleeping

Hey, thanks for all the help. but im a little lost on what some lines do (it works great, i just want to understand it):
in this line what does setw(10) do? also, i was warned by quite a few people about using << for binary files.

Code:

file << std::setw(10) << length << std::ends;

You're not really writing a binary file. You're writing the number of characters lexically.

setw(10) sets the width of the next field written to be 10 characters long, or longer. It will prepend ' ' (defaultly), for the integer written, so to write the number 10 it would write " 10". 10 characters is the most that a four byte int would need. This is the same as printf("%10d",length).

and im not sure what the vector is doing here:

Code:

std::vector<char> buf(length);

Instead of a string or char array, Elysia chose to write the data into a char vector. I presume that this is because std::string data is not guaranteed to be continuous in memory, if that is in fact the case. A vector is like an array, but safer.

also, in your rewritten search function, what happens when the word being searched for isnt in the file? it looks like an infinite loop

Indeed.

**whiteflags** · 06-15-2011

I'm a little floored that this has taken so long to get anywhere.

First decide how many bytes you want to write (see mask):

Code:

for (unsigned int mask = 0xff000000; mask > 0; mask >>= CHAR_BIT) 
	myfile.put((len & mask));

You may want to pick a smaller byte mask for smaller numbers, or if you know you're working on a machine with a crippled CPU. (BTW, istream::put, along with istream::get, is safe for binary files because they work with bytes only.)

Then when you read, you just fetch that many bytes again during the unserialize part. Now, when you open a strange binary file, it might be in Big Endian or Little Endian; one of these is different from the byte order on your machine. So you should call a byte-swapping routine after you do this part.

Code:

size_t len = 0;
vector<char> bytes(4, '\0');

myfile.read(&bytes[0], bytes.size());
len = (bytes[0] << CHAR_BIT * 3) | (bytes[1] << CHAR_BIT * 2) | (bytes[2] << CHAR_BIT) | bytes[3];

If you're opening files you write on the host machine, endianness doesn't matter.

Dump the string.

Now you know about all you need to know about writing and reading strings and integers portably in binary, which will be about 90% of the data you put in those files. ID3v1 tags found in MP3 files, for example, fit into 128 bytes. It should be fairly obvious you don't need length information there, just grab the whole block. So planning your data format is essential. It may turn out with substantial abuse that certain fields aren't long enough but that's just the way it goes. You put out another version of the format to address the problem.