Thread: C reading/writing files

  1. #1
    Registered User
    Join Date
    Nov 2010
    Posts
    14

    C reading/writing files

    Hello,

    I am stuck on my C program which allows users to add files into an output (archive) file and also extract them all back out, ie so user can back up files on his system.

    My program has two main commands one which allows user to specify files to add and the other to extract all files added.

    I am unsure on a few concepts. I can get the users input of the files to add using 'argv[]' array. I can 'open()' each file specified, 'read()' the contents and 'write()' to an output archive file specified by user.

    The issues I am stuck on is:
    ) Obviously the file size of files user wants to add will vary and the so will the type ie(.txt,.gif etc - thus I got variable 'char buffer[BUFSIZ]; ' used from <stdlib.h> instead of setting a defined value like buffer[?]

    ) using the 'read()' function, I am using the returned value of bytes read from there inside my 'write()' function, to copy the contents of that file into the output archive file (effectively adding files into this archive output file)

    Problem is - How can I structure each file added into the archive folder, so I can easily extract them?

    For example, I understand write() will write all files as binary, long string of bytes (0 & 1s) , inside this one output file. So how can I extract each file, using correct number of bytes and filenames, so they extract out back as individual original files when added?

    Code is to long at this moment to add here, but hope some can give any ideas?

    Thanks

  2. #2
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Firstly, we appreciate your complete and well thought out question.

    To answer your question, I would suggest maybe storing the data in the header of the file. You could store the name and size of each file that's in the archive. Maybe with the length of the filename right before the actual string. Something like:
    [name length - 1 byte][name string][file size as 4 bytes]

    At the very front of the file you could store the number of files in the archive. Once you read in the header, it's a trivial matter to find the start of any file's actual contents by adding the file sizes of all the preceding files.
    Last edited by itsme86; 11-30-2010 at 09:48 AM.
    If you understand what you're doing, you're not learning anything.

  3. #3
    Registered User
    Join Date
    Sep 2007
    Posts
    1,012
    Problem is - How can I structure each file added into the archive folder, so I can easily extract them?
    Look at the tar file format for an idea. Try "man 5 tar".

    The basic idea is to put a header at the beginning of each file. It could be something very simple: one byte specifying the length of the filename (assuming filenames are no longer than 256 chars), then the filename. Then 8 bytes specifying the size of the file, followed by the file. Tar (or Pax, the "new" tar) does things this way, although it has a lot more information for each entry.

    You could also create an index at the front of the file giving filenames, offsets, and sizes.

    I'd really recommend just using the tar format, though, since you seem to be doing archiving on Unix; why not stick with a standard?

  4. #4
    Registered User
    Join Date
    Nov 2010
    Posts
    14
    Thanks for quick replies guys. You have given me few ideas, if I use a header file would it probably be better to have it in a separate file to the actual output file?

    To give you better idea of what I have done so far, I have provided bit of the code from my program, to see if I am in heading in right direction?
    Code:
    		//variables defined here including file descriptors
    		char buffer[BUFSIZ]; //how big does this let file be????
    
    				for(i=2; i < argc; i++) //starts at argv[2] which should be a filename
    				{	
    							file1=open(argv[i], O_RDWR, S_IRWXG); 
    
    							if(file1 < 0) { //error }
    							else //file is ok
    							{
    								while(!eof)
    								{
    									file1_status=read(fo,buffer,sizeof(buffer)); 
    									
    									if(file1_status==-1){ //error }
    									else //if file was read ok
    									{
    										close(file1); //close read file
    										archiv=open("archivefile", O_APPEND | O_WRONLY | O_CREAT, S_IRWXU); //this opens or creates archive file
    										if(archiv < 0) { //error }
    										else //archieve file is ready to write
    										{ 
    											write_to=write(archiv, buffer, sizeof(buffer)); //fo2 is file descriptor returned and buffer is the contents copied
    											if( write_to<0) { //error }
    										else { eof=1; //ends while loop }
    										}
    									}
    								} }
    The other issue is how big is buffer[BUFSIZ] in stdlib.h? How can I create the buffer to be size of however big the file is not how big the default buffer[BUFSIZ] sets it to stop it writting blank data to file?

    Thanks

  5. #5
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Quote Originally Posted by daza166
    if I use a header file would it probably be better to have it in a separate file to the actual output file?
    I would definitely just add it to the actual output file so the user only needs to keep track of the one file.

    To write a partial buffer, write the number of bytes that read() read instead of the size of the buffer. read() returns the number of bytes that were read, so it should be stored in your file1_status variable when you get to the write() call.

    As far as the actual buffer size goes, I think you're on the right track using BUFSIZ. I wouldn't bother trying to make the buffer size match the file size.
    Last edited by itsme86; 11-30-2010 at 12:21 PM.
    If you understand what you're doing, you're not learning anything.

  6. #6
    Registered User
    Join Date
    Nov 2010
    Posts
    14
    Hello itsme86,

    Thanks for that useful info. To sum up everything down you have told me so far:

    1) I have changed ' write_to=write(archiv, buffer, file1_status); ' - so only writing into the output file the contents of buffer, the size of the retuned nbytes from the read() function.

    However, would the buffer be default size of whatever [BUFSIZ] is or do i need ' write(archiv,buffer[file1_status], file1_status) ' instead, as say if there's 786 bytes in file, but buffer is default set 4087, there would be blank 0s(garbage) written to end of file?

    2) When I 'cat archivefile' in terminal in ubuntu, say I added one text file with 786 bytes of data into my archive file, which was returned from 'read()' as 786 bytes thats ok , but after it showed that content text there was load of unreadable hex garbage after it, should that happen or just show the actual 786 bytes of clear content text?

    3) With the header file, not sure exactly how to add header inside the actual output archive file? Below, is this what you mean by header data inside archive file?
    Say this is my output archive file, after I add 2 files to it:
    Code:
    output (archive) file ---
    
    Header data --
    file1 name-- file1.txt
    file1 size -- 786 bytes
    file2 name --file2.gif
    file2 size -- 1024 bytes
    ........." all header data 'name - size' is written, using for loop, before actual content written below
    
    File content data --
    file1textfile2textfile3text........................................... "long string of data would be in binary below "
    111111111111111111111111111111111111111111111111111111111111 ........... "would be binary
    Now I know to include header file #include "/path/something.h" inside main .c program but not in this case? as I am not including header file, theres only an archive file - (has no extension)?

    When extracting files back out of the archive file, I will extract all files not just one, so I would start at beginning of the string and using 'lseek' to position the file pointer at certain postion - ie start at file1 so if file1 is 786 bytes I would start at 0 position and I know '787' bytes would be where file2 starts etc and then add 787 to 1024 to find file 3 start position and I would then just write the data content for each file using ' write(filename taken from header, permission stuff) ' with correct filename?

    I understand the theory of what I am going to do, just cant write it using code correctly, can you show me skeleton of the code I need?

    Thanks, hope that all makes sense?

  7. #7
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    This is probably a simplification of what I would do. I haven't tested the code or anything so there might be errors. There should obviously be error checking added as well and non-hardcoded filenames:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    
    unsigned int filesize(int fd)
    {
      struct stat st;
      fstat(fd, &st);
      return st.st_size;
    }
    
    int main(void)
    {
      struct fileinfo
      {
        char *name;
        unsigned int size;
      };
    
      struct fileinfo sourcefiles[] = { { "file1.bin", 0 }, { "file2.bin", 0 } };
      int numfiles = 2;
      int fdin, fdout;
      char buffer[BUFSIZ];
      int bytesread;
      int i;
    
      fdout = open("archive.blah", O_CREAT | O_WRONLY | O_TRUNC, S_IRUSR);
    
      // Reserve space for the header
      int headerlen = sizeof(short);
      for(i = 0;i < numfiles;++i)
        headerlen += 1 + strlen(sourcefiles[i].name) + sizeof(unsigned int);
      lseek(fdout, headerlen, SEEK_SET);
    
      // Write the contents of the files
      for(i = 0;i < numfiles;++i)
      {
        fdin = open(sourcefiles[i].name, O_RDONLY);
    
        sourcefiles[i].size = filesize(fdin);
    
        while((bytesread = read(fdin, buffer, sizeof(buffer))) > 0)
          write(fdout, buffer, bytesread);
    
        close(fdin);
      }
    
      // Go back and write the header
      lseek(fdout, 0, SEEK_SET);
      write(fdout, (short*)&numfiles, sizeof(short));
      for(i = 0;i < numfiles;++i)
      {
        unsigned char namelen = (unsigned char)strlen(sourcefiles[i].name);
        write(fdout, &namelen, 1);
        write(fdout, sourcefiles[i].name, namelen);
        write(fdout, &sourcefiles[i].size, sizeof(unsigned int));
      }
    
      close(fdout);
    
      return 0;
    }
    There's some endian issues to work out, but you get the idea at least.
    Last edited by itsme86; 11-30-2010 at 03:04 PM.
    If you understand what you're doing, you're not learning anything.

  8. #8
    Registered User
    Join Date
    Nov 2010
    Posts
    14
    Thanks itsme86,

    Thanks that really helped me!!! I have now successfully got the program taking user input of files to add, got some error checking on them and using read() write() adding them into archive file.

    I have got the header data going in as well, that should not be visible though should it when displaying archive file in terminal but the content data should tho?

    Ok just got a few questions on extracting files back out

    To extract files from archive:-
    Would I need to open the archive file, read the header data and use lseek() to move to positions where files start, end? How exactly would I do this?

    I understand, '0' would be start of file, but how would I read header data to find say where file2 content data is located in the string, ie how many bytes do I need to lseek() move pointor to where each file starts.

    I want to extract all files from archive, ie start file1, then move file2 etc and put them into users cwd (current working directory) as original files

    I would use open() write() to re-create(extract) files from archive

    Any ideas on how to code this, would be very appricated?

    Thanks
    daza166

  9. #9
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    This should get you started:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    
    struct fileinfo
    {
      char *name;
      unsigned int size;
      unsigned int offset;
    };
    
    void extract_file(int fdin, struct fileinfo file)
    {
      int fdout;
      char buffer[BUFSIZ];
      int bytesread;
      unsigned int bytesleft = file.size;
    
      fdout = open(file.name, O_CREAT | O_WRONLY | O_TRUNC, S_IRUSR);
    
      lseek(fdin, file.offset, SEEK_SET);
      while(bytesleft > 0 && (bytesread = read(fdin, buffer, sizeof(buffer) < bytesleft ? sizeof(buffer) : bytesleft)) > 0)
      {
        write(fdout, buffer, bytesread);
        bytesleft -= bytesread;
      }
    
      close(fdout);
    }
    
    int main(void)
    {
      int fdin;
      int numfiles;
      struct fileinfo *files;
      int i;
    
      fdin = open("archive.blah", O_RDONLY);
      read(fdin, (short*)&numfiles, sizeof(short));
    
      files = malloc(sizeof(*files) * numfiles);
    
      for(i = 0;i < numfiles;++i)
      {
        unsigned char namelen;
        read(fdin, &namelen, 1);
        files[i].name = malloc(namelen + 1);
        read(fdin, files[i].name, namelen);
        files[i].name[namelen] = '\0';
        read(fdin, &files[i].size, sizeof(unsigned int));
      }
    
      int headerlen = sizeof(short);
      for(i = 0;i < numfiles;++i)
        headerlen += 1 + strlen(files[i].name) + sizeof(unsigned int);
      if(numfiles > 0)
        files[0].offset = headerlen;
      for(i = 1;i < numfiles;++i)
          files[i].offset = files[i - 1].offset + files[i - 1].size;
    
      for(i = 0;i < numfiles;++i)
        extract_file(fdin, files[i]);
    
      close(fdin);
    
      for(i = 0;i < numfiles;++i)
        free(files[i].name);
      free(files);
    
      return 0;
    }
    That code should let you extract any file you want, not just the whole thing.

    Also, you can see the trade offs from not storing the file offset data in the header. They must be manually calculated during the extraction process.
    If you understand what you're doing, you're not learning anything.

  10. #10
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    It wold be better to treat the file as a doubly linked list using a fixed size header between files.

    Code:
    #pragma pack(1)
    
      struct fileinfo
      {
        int8_t  Name[128];
        uint32_t Size;
        uint32_t Previous;
        uint32_t Next;
      };
    
    #pragma pack()
    Previous and Next tell you where to seek to find the respective headers. Instead of memory pointers... you do it by manipulating the file pointer.

  11. #11
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Quote Originally Posted by CommonTater View Post
    It wold be better to treat the file as a doubly linked list using a fixed size header between files.

    Code:
    #pragma pack(1)
    
      struct fileinfo
      {
        int8_t  Name[128];
        uint32_t Size;
        uint32_t Previous;
        uint32_t Next;
      };
    
    #pragma pack()
    Previous and Next tell you where to seek to find the respective headers. Instead of memory pointers... you do it by manipulating the file pointer.
    "Better" is debatable. Your method would require a larger header. If size is important than my method is better. Besides, the calculation really isn't difficult. The bottleneck is going to be disk I/O.
    If you understand what you're doing, you're not learning anything.

  12. #12
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by itsme86 View Post
    "Better" is debatable. Your method would require a larger header. If size is important than my method is better. Besides, the calculation really isn't difficult. The bottleneck is going to be disk I/O.
    The way I showed is how you overcome the bottleneck.

    To write a struct into a file you need it opened in binary mode, so you would look at the first struct... seek directly to header->next and look at it, without having to read any of the intervening file content. Since it's doubly linked you can move backwards in the file without having to rewind and re-read. I've implemented this in the past (in Pascal) for text archiving and it's way faster than having to read the entire archive to get to the content you want... especially if the record you are seeking is the last one in a 3gb archive.
    Last edited by CommonTater; 12-01-2010 at 01:47 PM.

  13. #13
    Registered User
    Join Date
    Nov 2010
    Posts
    14

    Wink

    itsme86,

    just a little bit confused with extract code you provided, perhaps comments?.

    Is it using structure data from when files are read in? The problem is that when user runs extract command, the program would be restarted and there would be no values in structure array, regarding files wrote into archive.

    So, that code would not be valid? Would have to somehow search archive file for data but store in program where offsets are.

    is that code part of the original just amended or separate?

    At moment just want to be able to extract all files out, not individual ones.

    thanks
    daza166
    Last edited by daza166; 12-01-2010 at 02:27 PM.

  14. #14
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Quote Originally Posted by CommonTater View Post
    The way I showed is how you overcome the bottleneck.

    To write a struct into a file you need it opened in binary mode, so you would look at the first struct... seek directly to header->next and look at it, without having to read any of the intervening file content. Since it's doubly linked you can move backwards in the file without having to rewind and re-read. I've implemented this in the past (in Pascal) for text archiving and it's way faster than having to read the entire archive to get to the content you want... especially if the record you are seeking is the last one in a 3gb archive.
    My method doesn't read the whole archive. You can jump directly to any stored file you want with my method. In fact, you only need to read the header one time and then you have random access to the metadata for any file stored in the archive that you want without having to walk through a list. All the lseek()'ing to the previous and next pointers is slow. I still believe my method is faster than yours. Yours definitely doesn't overcome the disk I/O bottleneck more than mine.

    Quote Originally Posted by daza166
    itsme86,

    just a little bit confused with extract code you provided, perhaps comments?.

    Is it using structure data from when files are read in? The problem is that when user runs extract command, the program would be restarted and there would be no values in structure array, regarding files wrote into archive.

    So, that code would not be valid? Would have to somehow search archive file for data but store in program where offsets are.

    is that code part of the original just amended or separate?

    At moment just want to be able to extract all files out, not individual ones.

    thanks
    daza166
    This is a completely separate program. So you'd have the first one as the archiver, and this one as the unarchiver. Combining them would be an exercise left up to you.
    If you understand what you're doing, you're not learning anything.

  15. #15
    Registered User
    Join Date
    Nov 2010
    Posts
    14
    itsme86

    ok, so they are two programs. With the extract program would it not be easier to just extract all files, instead of one?

    The first two bytes of the archive file, would be int like 13, which is number of files inside archive. So, if extract first 2 bytes, got the no of files for my 'numfiles' variable.

    Just is there simple way of reading the header data, find out how many bytes in is file1, the name, size etc, plus all rest of the files, then populate structure based on the data so then I can just use my structure offset variable, using for loop, loop each file and use lseek() get value from filename[i].offset and filename[i].size and put into (read(archivefile, filename[i].offset, filename[i].size) ?

    Does that make sense?

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Linux IPC - reading/writing files, getting garbage
    By kbfirebreather in forum C Programming
    Replies: 9
    Last Post: 02-01-2009, 02:55 PM
  2. *.cpp and *.h files understanding
    By ElastoManiac in forum C++ Programming
    Replies: 4
    Last Post: 06-11-2006, 04:45 AM
  3. Linking header files, Source files and main program(Accel. C++)
    By Daniel Primed in forum C++ Programming
    Replies: 3
    Last Post: 01-17-2006, 11:46 AM
  4. Multiple Cpp Files
    By w4ck0z in forum C++ Programming
    Replies: 5
    Last Post: 11-14-2005, 02:41 PM
  5. Folding@Home Cboard team?
    By jverkoey in forum A Brief History of Cprogramming.com
    Replies: 398
    Last Post: 10-11-2005, 08:44 AM