Copying Sparse Files

**invalidCRC** · 09-22-2014

Hello all,

I am toying around with a sample application that copies sparse files, but I haven't figured it out. I have tried to use read/write, but they seem to create dense files no matter what I try. My latest attempt is using fseek and fwrite/fread - what am I doing wrong?

Code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <error.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
#include <stdint.h>


#include <assert.h>


#define BUF_SIZE 8192        // 8k ;)


static off_t fsize(const char *filename);


/**
 * fsize(const char *filename)
 * 
 * @brief Returns the filesize of a file using stat.h
 * @param filename
 * @return -1 if there is an error, zero or positive value otherwise
 */
static off_t fsize(const char *filename)
{
    struct stat st = { 0 };


    if (stat(filename, &st) == 0) {


        return st.st_size;
    }


    return (-1);
}


int main(int argc, char **argv)
{
    FILE *inputFD = NULL;
    FILE *outputFD = NULL;
    char buf[8192] = { };
    ssize_t file_size = 0;
    ssize_t apparent_size = 0;


    /* Open a file descriptor for the input file */
    if ((inputFD = fopen(argv[1], "r")) == NULL) {
        perror("Error opening input file descriptor");
        return (-1);
    }


    /* Determine the file_size of the input file */
    if ((apparent_size = fsize(argv[1])) < 0) {
        perror("Unable to determine accurate size of file");
        return (-1);
    }
    printf("file's apparent size:%i\n",(unsigned int)apparent_size);


    /* Open a file descriptor for the output file */
    if ((outputFD = fopen(argv[2], "w")) == NULL) {
        perror("Error opening output file descriptor");
        return (-1);
    }


    //~ /* Lets advise the kernel that we are going to attempt a long
     //~ * sequential write */
    //~ if (posix_fadvise(inputFD, 0, 0, POSIX_FADV_SEQUENTIAL) < 0) {
        //~ perror
            //~ ("Unable to advise the kernel for the upcoming sequential write");
        //~ // Continue anyways...
    //~ }


    /* Read from the input file descriptor and write to the output file descriptor
     * while keeping the 8k within the CPUs L1 cache */


    int byteCount = 0;
    while ((file_size = fread(buf, sizeof(char), sizeof(buf),inputFD)) > 0) {
        
        int i = 0;
        int nullBytes = 0;
        
        for (i = 0; i < file_size; i++) {
            byteCount ++;
            if ((file_size > 0) && (buf[i] == '\0')) {
                //~ if (buf[i + 1] == '\0') {
                    
                    fseek(outputFD,1,SEEK_SET);
                    
                        //printf("buff:%s pos:%i fs:%i\n", buf, i, file_size);
                    //printf("file hole detected\n");
                    //i++;
                //~ }
            } else {
                printf("writing %i\n",i);
                fwrite(buf, sizeof(char), sizeof(buf),outputFD);
                nullBytes = 0;
            }
            
            //~ if(byteCount == apparent_size) {
                //~ continue;
            //~ }


        }


    }


    /* Close the input file descriptor */
    if (fclose(inputFD) == -1) {
        perror("closing inputFD");
        return (-1);
    }


    /* Close the output file descriptor */
    if (fclose(outputFD) == -1) {
        perror("closing outputFD");
        return (-1);
    }


    return 0;
}

**cas** · 09-22-2014

The first problem is that, when you encounter a zero, you're seeking 1 byte from the beginning of the file. Presumably you want to seek 1 byte from the current position, so use SEEK_CUR instead of SEEK_SET.

Also, seeking by itself will not extend a file. You have to write something to the file after all your seeking (that something could be a zero, of course). That means that you'll probably wind up with at least one block of data on disk, so make sure you're testing with a large enough file to notice the holes. If you were so inclined, however, you could look at the ftruncate() function as well.

**invalidCRC** · 09-23-2014

Originally Posted by cas

The first problem is that, when you encounter a zero, you're seeking 1 byte from the beginning of the file. Presumably you want to seek 1 byte from the current position, so use SEEK_CUR instead of SEEK_SET.

Also, seeking by itself will not extend a file. You have to write something to the file after all your seeking (that something could be a zero, of course). That means that you'll probably wind up with at least one block of data on disk, so make sure you're testing with a large enough file to notice the holes. If you were so inclined, however, you could look at the ftruncate() function as well.

Okay - so I have made your changes and I think you are write that at least one byte is written to disk. Below is the updated code.

As per input, here is how I create my sparse file

Code:

dev@optimus:~/dev$ dd if=/dev/zero of=sparse_file.img bs=1 count=0 seek=128M
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000530044 s, 0.0 kB/s


dev@optimus:~/dev$ du -h  sparse_file.img ; du -h --apparent-size sparse_file.img 
0    sparse_file.img
128M sparse_file.img

Then after executing my application, I see:

Code:

du -h  out; du -h --apparent-size out
0 out
128M    out

Looks like it works!

Code:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <error.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
#include <stdint.h>


#include <assert.h>


#define BUF_SIZE 8192        // 8k ;)


static off_t fsize(const char *filename);


/**
 * fsize(const char *filename)
 * 
 * @brief Returns the filesize of a file using stat.h
 * @param filename
 * @return -1 if there is an error, zero or positive value otherwise
 */
static off_t fsize(const char *filename)
{
    struct stat st = { 0 };


    if (stat(filename, &st) == 0) {


        return st.st_size;
    }


    return (-1);
}


/**
 * main(int argc, char **argv)
 * 
 * @brief A main function
 * @param argc
 * @param argv
 * @return -1 if error, 0 for success
 */
int main(int argc, char **argv)
{
    FILE *inputFP = NULL;
    FILE *outputFP = NULL;
    char buf[8192] = { };
    ssize_t file_size = 0;
    ssize_t apparent_size = 0;


    /* Open a file descriptor for the input file */
    if ((inputFP = fopen(argv[1], "r")) == NULL) {
        perror("Error opening input file descriptor");
        return (-1);
    }


    /* Determine the file_size of the input file */
    if ((apparent_size = fsize(argv[1])) < 0) {
        perror("Unable to determine accurate size of file");
        return (-1);
    }
    printf("file's apparent size:%i\n", (unsigned int)apparent_size);


    /* Open a file descriptor for the output file */
    if ((outputFP = fopen(argv[2], "w")) == NULL) {
        perror("Error opening output file descriptor");
        return (-1);
    }


    /* Read from the input file descriptor and write to the output file descriptor
     * while keeping the 8k within the CPUs L1 cache */
    while ((file_size = fread(buf, sizeof(char), sizeof(buf), inputFP)) > 0) {


        int i = 0;
        for (i = 0; i < file_size; i++) {


            if ((file_size > 0) && (buf[i] == '\0')
                && (buf[i + 1] == '\0')) {


                if (ftruncate(fileno(outputFP), i + 1) < 0) {
                    perror("ftruncating file hole\n");
                    return (-1);
                }


            } else {


                fwrite(&buf[i], sizeof(char), sizeof(char),
                       outputFP);
                clearerr(outputFP);
                if (ferror(outputFP)) {
                    perror("write\n");
                    break;
                }
            }


        }


    }


    /* Close the input file descriptor */
    if (fclose(inputFP) == -1) {
        perror("closing inputFP");
        return (-1);
    }


    /* Close the output file descriptor */
    if (fclose(outputFP) == -1) {
        perror("closing outputFP");
        return (-1);
    }


    return 0;
}

**Salem** · 09-23-2014

> if ((file_size > 0) && (buf[i] == '\0')
> && (buf[i + 1] == '\0'))
This is an out of bounds access on the last iteration of the loop (if the buffer is full), or accessing garbage data in the buffer.

**invalidCRC** · 09-24-2014

Originally Posted by Salem

> if ((file_size > 0) && (buf[i] == '\0')
> && (buf[i + 1] == '\0'))
This is an out of bounds access on the last iteration of the loop (if the buffer is full), or accessing garbage data in the buffer.

Thanks Salem! Good catch, I'll update my code block in a bit later today

**Salem** · 09-24-2014

More oddness.

> clearerr(outputFP);
> if (ferror(outputFP))
Calling clearerr() before ferror() makes no sense.

> if (fclose(inputFP) == -1)
You should compare with EOF, not -1
The standard just requires EOF to be negative.

**invalidCRC** · 09-25-2014

Better?

Code:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>
#include <error.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
#include <stdint.h>


#include <assert.h>


#define BUF_SIZE 8192		// 8k ;)


static off_t fsize(const char *filename);


/**
 * fsize(const char *filename)
 * 
 * @brief Returns the filesize of a file using stat.h
 * @param filename
 * @return -1 if there is an error, zero or positive value otherwise
 */
static off_t fsize(const char *filename)
{
	struct stat st = { 0 };


	if (stat(filename, &st) == 0) {


		return st.st_size;
	}


	return (-1);
}


/**
 * main(int argc, char **argv)
 * 
 * @brief A main function
 * @param argc
 * @param argv
 * @return -1 if error, 0 for success
 */
int main(int argc, char **argv)
{
	FILE *inputFP = NULL;
	FILE *outputFP = NULL;
	char buf[8192] = { };
	ssize_t file_size = 0;
	ssize_t apparent_size = 0;


	/* Open a file descriptor for the input file */
	if ((inputFP = fopen(argv[1], "r")) == NULL) {
		perror("Error opening input file descriptor");
		return (-1);
	}


	/* Determine the file_size of the input file */
	if ((apparent_size = fsize(argv[1])) < 0) {
		perror("Unable to determine accurate size of file");
		return (-1);
	}
	printf("file's apparent size:%i\n", (unsigned int)apparent_size);


	/* Open a file descriptor for the output file */
	if ((outputFP = fopen(argv[2], "w")) == NULL) {
		perror("Error opening output file descriptor");
		return (-1);
	}


	/* Lets advise the kernel that we are going to attempt a long
	 * sequential write */
	if (posix_fadvise(fileno(inputFP), 0, 0, POSIX_FADV_SEQUENTIAL) < 0) {
		perror("Unable to advise the kernel for the upcoming sequential write");
		// Continue anyways...
	}


	/* Create a file using ftruncate and lets carry on */
	if (ftruncate(fileno(outputFP), apparent_size) < 0) {
		perror("Unable to create a file that will have a similar size to the original");
		return (-1);
	}


	/* Read from the input file descriptor and write to the output file descriptor
	 * while keeping the 8k within the CPUs L1 cache */


	int i = 0;
	unsigned long holeSize = 0;
	while ((file_size = fread(buf, sizeof(char), sizeof(buf), inputFP)) > 0) {


		for (i = 0; i < file_size; i++) {


			if (buf[i] == '\0') {
				holeSize++;
			} else if (holeSize > 0) {
				fseek(outputFP, holeSize, SEEK_CUR);
				fwrite(&buf[i], 1, 1, outputFP);
				clearerr(outputFP);
				if (ferror(outputFP)) {
					perror("write\n");
					break;
				}
				holeSize = 0;
			} else {


				fwrite(&buf[i], sizeof(char), sizeof(char), outputFP);
				
				if (ferror(outputFP)) {
					perror("write\n");
					break;
				}
			}


		}


	}


	/* Close the input file descriptor */
	if (fclose(inputFP) < 0) {
		perror("closing inputFP");
		return (-1);
	}


	/* Close the output file descriptor */
	if (fclose(outputFP) < 0) {
		perror("closing outputFP");
		return (-1);
	}


	return 0;
}

**Salem** · 09-26-2014

You're still calling clearerr() before ferror() in one instance.

Code:

    /* Read from the input file descriptor and write to the output file descriptor
     * while keeping the 8k within the CPUs L1 cache */

Your reason for choosing your buffer size is bogus.
The whole of buff is written to by the fread() call.
Each element of buff is read once (if zero), or twice (if non-zero)
You have other data as well (the rest of the local stack frame for instance), not to mention the effects of
- calling other functions
- traps into the OS to physically read/write the disk
- the OS itself forcing context switches to other processes.

The elephants in the room are all those fread / fwrite / fseek calls.
Memory access times are measured in nanoseconds.
Hard disk head seek times are measured in milliseconds (that's 1M times slower).

Why bother worrying about whether something takes 1 or 2 seconds when you know there is a delay of a fortnight coming up real soon?

> char buf[8192]
You don't even use your #define value.
IMO, you would be better off to start with using BUFSIZ, which is a constant in stdio.h, and is the optimal size for file operations, as determined by the implementers of your standard C libary.
char buf[BUFSIZ];

Then there is the 'time' command.
time ./copysparse infile outfile
Depending on your system, this should print out how much time (real time, user time and system time) the process spent performing the given task.

Also try using the 'top' command (in another terminal window). If shows you the current active processes.
You might be surprised by how little CPU time your copy program takes.

**invalidCRC** · 09-26-2014

Originally Posted by Salem

You're still calling clearerr() before ferror() in one instance.

Code:

    /* Read from the input file descriptor and write to the output file descriptor
     * while keeping the 8k within the CPUs L1 cache */

Your reason for choosing your buffer size is bogus.
The whole of buff is written to by the fread() call.
Each element of buff is read once (if zero), or twice (if non-zero)
You have other data as well (the rest of the local stack frame for instance), not to mention the effects of
- calling other functions
- traps into the OS to physically read/write the disk
- the OS itself forcing context switches to other processes.

The elephants in the room are all those fread / fwrite / fseek calls.
Memory access times are measured in nanoseconds.
Hard disk head seek times are measured in milliseconds (that's 1M times slower).

Why bother worrying about whether something takes 1 or 2 seconds when you know there is a delay of a fortnight coming up real soon?

> char buf[8192]
You don't even use your #define value.
IMO, you would be better off to start with using BUFSIZ, which is a constant in stdio.h, and is the optimal size for file operations, as determined by the implementers of your standard C libary.
char buf[BUFSIZ];

Then there is the 'time' command.
time ./copysparse infile outfile
Depending on your system, this should print out how much time (real time, user time and system time) the process spent performing the given task.

Also try using the 'top' command (in another terminal window). If shows you the current active processes.
You might be surprised by how little CPU time your copy program takes.

Hi Salem,

Excellent tips - I really appreciate it. Yes I understand that I/O will be slow especially if there isn't a ram disk or SSD. Alternatively MMAP too.

Should I also use something like posix_fallocate() to verify that if indeed there could be enough data on disk to create duplicate file (with the apparent size)?

Re: posix_fallocate(3) - Linux manual page

Thread: Copying Sparse Files

Thread Tools

Search Thread

Display

Copying Sparse Files

Similar Threads

binary files: copying jpeg or video files

copying two files into one !

Identifying sparse files programatically in C

Moving files/deleting files/ copying, etc

copying files

Tags for this Thread