Thread: Copying Sparse Files

  1. #1
    Registered User
    Join Date
    Sep 2014
    Posts
    6

    Angry Copying Sparse Files

    Hello all,

    I am toying around with a sample application that copies sparse files, but I haven't figured it out. I have tried to use read/write, but they seem to create dense files no matter what I try. My latest attempt is using fseek and fwrite/fread - what am I doing wrong?

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <error.h>
    #include <string.h>
    #include <unistd.h>
    #include <ctype.h>
    #include <stdint.h>
    
    
    #include <assert.h>
    
    
    #define BUF_SIZE 8192        // 8k ;)
    
    
    static off_t fsize(const char *filename);
    
    
    /**
     * fsize(const char *filename)
     * 
     * @brief Returns the filesize of a file using stat.h
     * @param filename
     * @return -1 if there is an error, zero or positive value otherwise
     */
    static off_t fsize(const char *filename)
    {
        struct stat st = { 0 };
    
    
        if (stat(filename, &st) == 0) {
    
    
            return st.st_size;
        }
    
    
        return (-1);
    }
    
    
    int main(int argc, char **argv)
    {
        FILE *inputFD = NULL;
        FILE *outputFD = NULL;
        char buf[8192] = { };
        ssize_t file_size = 0;
        ssize_t apparent_size = 0;
    
    
        /* Open a file descriptor for the input file */
        if ((inputFD = fopen(argv[1], "r")) == NULL) {
            perror("Error opening input file descriptor");
            return (-1);
        }
    
    
        /* Determine the file_size of the input file */
        if ((apparent_size = fsize(argv[1])) < 0) {
            perror("Unable to determine accurate size of file");
            return (-1);
        }
        printf("file's apparent size:%i\n",(unsigned int)apparent_size);
    
    
        /* Open a file descriptor for the output file */
        if ((outputFD = fopen(argv[2], "w")) == NULL) {
            perror("Error opening output file descriptor");
            return (-1);
        }
    
    
        //~ /* Lets advise the kernel that we are going to attempt a long
         //~ * sequential write */
        //~ if (posix_fadvise(inputFD, 0, 0, POSIX_FADV_SEQUENTIAL) < 0) {
            //~ perror
                //~ ("Unable to advise the kernel for the upcoming sequential write");
            //~ // Continue anyways...
        //~ }
    
    
        /* Read from the input file descriptor and write to the output file descriptor
         * while keeping the 8k within the CPUs L1 cache */
    
    
        int byteCount = 0;
        while ((file_size = fread(buf, sizeof(char), sizeof(buf),inputFD)) > 0) {
            
            int i = 0;
            int nullBytes = 0;
            
            for (i = 0; i < file_size; i++) {
                byteCount ++;
                if ((file_size > 0) && (buf[i] == '\0')) {
                    //~ if (buf[i + 1] == '\0') {
                        
                        fseek(outputFD,1,SEEK_SET);
                        
                            //printf("buff:%s pos:%i fs:%i\n", buf, i, file_size);
                        //printf("file hole detected\n");
                        //i++;
                    //~ }
                } else {
                    printf("writing %i\n",i);
                    fwrite(buf, sizeof(char), sizeof(buf),outputFD);
                    nullBytes = 0;
                }
                
                //~ if(byteCount == apparent_size) {
                    //~ continue;
                //~ }
    
    
            }
    
    
        }
    
    
        /* Close the input file descriptor */
        if (fclose(inputFD) == -1) {
            perror("closing inputFD");
            return (-1);
        }
    
    
        /* Close the output file descriptor */
        if (fclose(outputFD) == -1) {
            perror("closing outputFD");
            return (-1);
        }
    
    
        return 0;
    }

  2. #2
    Registered User
    Join Date
    Sep 2007
    Posts
    1,012
    The first problem is that, when you encounter a zero, you're seeking 1 byte from the beginning of the file. Presumably you want to seek 1 byte from the current position, so use SEEK_CUR instead of SEEK_SET.

    Also, seeking by itself will not extend a file. You have to write something to the file after all your seeking (that something could be a zero, of course). That means that you'll probably wind up with at least one block of data on disk, so make sure you're testing with a large enough file to notice the holes. If you were so inclined, however, you could look at the ftruncate() function as well.

  3. #3
    Registered User
    Join Date
    Sep 2014
    Posts
    6
    Quote Originally Posted by cas View Post
    The first problem is that, when you encounter a zero, you're seeking 1 byte from the beginning of the file. Presumably you want to seek 1 byte from the current position, so use SEEK_CUR instead of SEEK_SET.

    Also, seeking by itself will not extend a file. You have to write something to the file after all your seeking (that something could be a zero, of course). That means that you'll probably wind up with at least one block of data on disk, so make sure you're testing with a large enough file to notice the holes. If you were so inclined, however, you could look at the ftruncate() function as well.
    Okay - so I have made your changes and I think you are write that at least one byte is written to disk. Below is the updated code.

    As per input, here is how I create my sparse file

    Code:
    dev@optimus:~/dev$ dd if=/dev/zero of=sparse_file.img bs=1 count=0 seek=128M
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 0.000530044 s, 0.0 kB/s
    
    
    dev@optimus:~/dev$ du -h  sparse_file.img ; du -h --apparent-size sparse_file.img 
    0    sparse_file.img
    128M sparse_file.img
    Then after executing my application, I see:

    Code:
    du -h  out; du -h --apparent-size out
    0 out
    128M    out
    Looks like it works!

    Code:
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/stat.h>
    #include <sys/types.h>
    #include <fcntl.h>
    #include <error.h>
    #include <string.h>
    #include <unistd.h>
    #include <ctype.h>
    #include <stdint.h>
    
    
    #include <assert.h>
    
    
    #define BUF_SIZE 8192        // 8k ;)
    
    
    static off_t fsize(const char *filename);
    
    
    /**
     * fsize(const char *filename)
     * 
     * @brief Returns the filesize of a file using stat.h
     * @param filename
     * @return -1 if there is an error, zero or positive value otherwise
     */
    static off_t fsize(const char *filename)
    {
        struct stat st = { 0 };
    
    
        if (stat(filename, &st) == 0) {
    
    
            return st.st_size;
        }
    
    
        return (-1);
    }
    
    
    /**
     * main(int argc, char **argv)
     * 
     * @brief A main function
     * @param argc
     * @param argv
     * @return -1 if error, 0 for success
     */
    int main(int argc, char **argv)
    {
        FILE *inputFP = NULL;
        FILE *outputFP = NULL;
        char buf[8192] = { };
        ssize_t file_size = 0;
        ssize_t apparent_size = 0;
    
    
        /* Open a file descriptor for the input file */
        if ((inputFP = fopen(argv[1], "r")) == NULL) {
            perror("Error opening input file descriptor");
            return (-1);
        }
    
    
        /* Determine the file_size of the input file */
        if ((apparent_size = fsize(argv[1])) < 0) {
            perror("Unable to determine accurate size of file");
            return (-1);
        }
        printf("file's apparent size:%i\n", (unsigned int)apparent_size);
    
    
        /* Open a file descriptor for the output file */
        if ((outputFP = fopen(argv[2], "w")) == NULL) {
            perror("Error opening output file descriptor");
            return (-1);
        }
    
    
        /* Read from the input file descriptor and write to the output file descriptor
         * while keeping the 8k within the CPUs L1 cache */
        while ((file_size = fread(buf, sizeof(char), sizeof(buf), inputFP)) > 0) {
    
    
            int i = 0;
            for (i = 0; i < file_size; i++) {
    
    
                if ((file_size > 0) && (buf[i] == '\0')
                    && (buf[i + 1] == '\0')) {
    
    
                    if (ftruncate(fileno(outputFP), i + 1) < 0) {
                        perror("ftruncating file hole\n");
                        return (-1);
                    }
    
    
                } else {
    
    
                    fwrite(&buf[i], sizeof(char), sizeof(char),
                           outputFP);
                    clearerr(outputFP);
                    if (ferror(outputFP)) {
                        perror("write\n");
                        break;
                    }
                }
    
    
            }
    
    
        }
    
    
        /* Close the input file descriptor */
        if (fclose(inputFP) == -1) {
            perror("closing inputFP");
            return (-1);
        }
    
    
        /* Close the output file descriptor */
        if (fclose(outputFP) == -1) {
            perror("closing outputFP");
            return (-1);
        }
    
    
        return 0;
    }

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    > if ((file_size > 0) && (buf[i] == '\0')
    > && (buf[i + 1] == '\0'))
    This is an out of bounds access on the last iteration of the loop (if the buffer is full), or accessing garbage data in the buffer.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Sep 2014
    Posts
    6
    Quote Originally Posted by Salem View Post
    > if ((file_size > 0) && (buf[i] == '\0')
    > && (buf[i + 1] == '\0'))
    This is an out of bounds access on the last iteration of the loop (if the buffer is full), or accessing garbage data in the buffer.
    Thanks Salem! Good catch, I'll update my code block in a bit later today

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    More oddness.

    > clearerr(outputFP);
    > if (ferror(outputFP))
    Calling clearerr() before ferror() makes no sense.

    > if (fclose(inputFP) == -1)
    You should compare with EOF, not -1
    The standard just requires EOF to be negative.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Registered User
    Join Date
    Sep 2014
    Posts
    6
    Better?

    Code:
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/stat.h>
    #include <sys/types.h>
    #include <fcntl.h>
    #include <error.h>
    #include <string.h>
    #include <unistd.h>
    #include <ctype.h>
    #include <stdint.h>
    
    
    #include <assert.h>
    
    
    #define BUF_SIZE 8192		// 8k ;)
    
    
    static off_t fsize(const char *filename);
    
    
    /**
     * fsize(const char *filename)
     * 
     * @brief Returns the filesize of a file using stat.h
     * @param filename
     * @return -1 if there is an error, zero or positive value otherwise
     */
    static off_t fsize(const char *filename)
    {
    	struct stat st = { 0 };
    
    
    	if (stat(filename, &st) == 0) {
    
    
    		return st.st_size;
    	}
    
    
    	return (-1);
    }
    
    
    /**
     * main(int argc, char **argv)
     * 
     * @brief A main function
     * @param argc
     * @param argv
     * @return -1 if error, 0 for success
     */
    int main(int argc, char **argv)
    {
    	FILE *inputFP = NULL;
    	FILE *outputFP = NULL;
    	char buf[8192] = { };
    	ssize_t file_size = 0;
    	ssize_t apparent_size = 0;
    
    
    	/* Open a file descriptor for the input file */
    	if ((inputFP = fopen(argv[1], "r")) == NULL) {
    		perror("Error opening input file descriptor");
    		return (-1);
    	}
    
    
    	/* Determine the file_size of the input file */
    	if ((apparent_size = fsize(argv[1])) < 0) {
    		perror("Unable to determine accurate size of file");
    		return (-1);
    	}
    	printf("file's apparent size:%i\n", (unsigned int)apparent_size);
    
    
    	/* Open a file descriptor for the output file */
    	if ((outputFP = fopen(argv[2], "w")) == NULL) {
    		perror("Error opening output file descriptor");
    		return (-1);
    	}
    
    
    	/* Lets advise the kernel that we are going to attempt a long
    	 * sequential write */
    	if (posix_fadvise(fileno(inputFP), 0, 0, POSIX_FADV_SEQUENTIAL) < 0) {
    		perror("Unable to advise the kernel for the upcoming sequential write");
    		// Continue anyways...
    	}
    
    
    	/* Create a file using ftruncate and lets carry on */
    	if (ftruncate(fileno(outputFP), apparent_size) < 0) {
    		perror("Unable to create a file that will have a similar size to the original");
    		return (-1);
    	}
    
    
    	/* Read from the input file descriptor and write to the output file descriptor
    	 * while keeping the 8k within the CPUs L1 cache */
    
    
    	int i = 0;
    	unsigned long holeSize = 0;
    	while ((file_size = fread(buf, sizeof(char), sizeof(buf), inputFP)) > 0) {
    
    
    		for (i = 0; i < file_size; i++) {
    
    
    			if (buf[i] == '\0') {
    				holeSize++;
    			} else if (holeSize > 0) {
    				fseek(outputFP, holeSize, SEEK_CUR);
    				fwrite(&buf[i], 1, 1, outputFP);
    				clearerr(outputFP);
    				if (ferror(outputFP)) {
    					perror("write\n");
    					break;
    				}
    				holeSize = 0;
    			} else {
    
    
    				fwrite(&buf[i], sizeof(char), sizeof(char), outputFP);
    				
    				if (ferror(outputFP)) {
    					perror("write\n");
    					break;
    				}
    			}
    
    
    		}
    
    
    	}
    
    
    	/* Close the input file descriptor */
    	if (fclose(inputFP) < 0) {
    		perror("closing inputFP");
    		return (-1);
    	}
    
    
    	/* Close the output file descriptor */
    	if (fclose(outputFP) < 0) {
    		perror("closing outputFP");
    		return (-1);
    	}
    
    
    	return 0;
    }

  8. #8
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    You're still calling clearerr() before ferror() in one instance.

    Code:
        /* Read from the input file descriptor and write to the output file descriptor
         * while keeping the 8k within the CPUs L1 cache */
    Your reason for choosing your buffer size is bogus.
    The whole of buff is written to by the fread() call.
    Each element of buff is read once (if zero), or twice (if non-zero)
    You have other data as well (the rest of the local stack frame for instance), not to mention the effects of
    - calling other functions
    - traps into the OS to physically read/write the disk
    - the OS itself forcing context switches to other processes.

    The elephants in the room are all those fread / fwrite / fseek calls.
    Memory access times are measured in nanoseconds.
    Hard disk head seek times are measured in milliseconds (that's 1M times slower).

    Why bother worrying about whether something takes 1 or 2 seconds when you know there is a delay of a fortnight coming up real soon?

    > char buf[8192]
    You don't even use your #define value.
    IMO, you would be better off to start with using BUFSIZ, which is a constant in stdio.h, and is the optimal size for file operations, as determined by the implementers of your standard C libary.
    char buf[BUFSIZ];


    Then there is the 'time' command.
    time ./copysparse infile outfile
    Depending on your system, this should print out how much time (real time, user time and system time) the process spent performing the given task.

    Also try using the 'top' command (in another terminal window). If shows you the current active processes.
    You might be surprised by how little CPU time your copy program takes.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  9. #9
    Registered User
    Join Date
    Sep 2014
    Posts
    6
    Quote Originally Posted by Salem View Post
    You're still calling clearerr() before ferror() in one instance.

    Code:
        /* Read from the input file descriptor and write to the output file descriptor
         * while keeping the 8k within the CPUs L1 cache */
    Your reason for choosing your buffer size is bogus.
    The whole of buff is written to by the fread() call.
    Each element of buff is read once (if zero), or twice (if non-zero)
    You have other data as well (the rest of the local stack frame for instance), not to mention the effects of
    - calling other functions
    - traps into the OS to physically read/write the disk
    - the OS itself forcing context switches to other processes.

    The elephants in the room are all those fread / fwrite / fseek calls.
    Memory access times are measured in nanoseconds.
    Hard disk head seek times are measured in milliseconds (that's 1M times slower).

    Why bother worrying about whether something takes 1 or 2 seconds when you know there is a delay of a fortnight coming up real soon?

    > char buf[8192]
    You don't even use your #define value.
    IMO, you would be better off to start with using BUFSIZ, which is a constant in stdio.h, and is the optimal size for file operations, as determined by the implementers of your standard C libary.
    char buf[BUFSIZ];


    Then there is the 'time' command.
    time ./copysparse infile outfile
    Depending on your system, this should print out how much time (real time, user time and system time) the process spent performing the given task.

    Also try using the 'top' command (in another terminal window). If shows you the current active processes.
    You might be surprised by how little CPU time your copy program takes.
    Hi Salem,

    Excellent tips - I really appreciate it. Yes I understand that I/O will be slow especially if there isn't a ram disk or SSD. Alternatively MMAP too.

    Should I also use something like posix_fallocate() to verify that if indeed there could be enough data on disk to create duplicate file (with the apparent size)?

    Re: posix_fallocate(3) - Linux manual page

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. binary files: copying jpeg or video files
    By cfanatic in forum C Programming
    Replies: 5
    Last Post: 07-19-2012, 08:17 AM
  2. copying two files into one !
    By rajarshi in forum C Programming
    Replies: 10
    Last Post: 11-10-2011, 03:54 AM
  3. Identifying sparse files programatically in C
    By rohan_ak1 in forum Linux Programming
    Replies: 2
    Last Post: 11-12-2009, 10:15 AM
  4. Moving files/deleting files/ copying, etc
    By Nakeerb in forum C++ Programming
    Replies: 1
    Last Post: 10-11-2002, 05:45 PM
  5. copying files
    By gls in forum C Programming
    Replies: 16
    Last Post: 09-06-2001, 12:14 AM

Tags for this Thread