Thread: Compressing XML data using Lniq and GZipStream

  1. #1
    Registered User
    Join Date
    Jan 2009
    Posts
    2

    Compressing XML data using Lniq and GZipStream

    Hello, I am attempting to write a small utility for switching WoW user interfaces, and because the grand majority of the data is text, I decided to store all of it in XML files. I wrote these small functions. (note this are not the final code that will go into my program, it contains some quick fixes that should do the same thing, IE the 'kindalocal' variable)

    Create an XML tree for a folder:
    Code:
            private static XElement CreateXMLTreeForFolder(string folder)
            {
                string[] dirs = Directory.GetDirectories(folder, "*", SearchOption.TopDirectoryOnly);
                string[] files = Directory.GetFiles(folder, "*", SearchOption.TopDirectoryOnly);
    
                string kindalocal = folder.TrimEnd('\\');
                int indexs = kindalocal.LastIndexOf('\\');
                kindalocal = kindalocal.Substring(indexs + 1);
    #if DEBUG
                Console.WriteLine("Writing folder {0}...", kindalocal);
    #endif
                string localname = XmlConvert.EncodeName(kindalocal);
    
                XElement[] xfiles = new XElement[files.Length];
                XElement[] xdirs = new XElement[dirs.Length];
                for (int i = 0; i < files.Length; i++)
                {
                    xfiles[i] = CreateXMLTreeForFile(files[i]);
                }
                for (int i = 0; i < dirs.Length; i++)
                {
                    xdirs[i] = CreateXMLTreeForFolder(dirs[i]);
    
                }
                XElement fr = new XElement(localname);
                for (int i = 0; i < xdirs.Length; i++)
                {
                    fr.Add(xdirs[i]);
                }
                for (int i = 0; i < xfiles.Length; i++)
                {
                    fr.Add(xfiles[i]);
                }
                return fr;
            }
    Create an XML tree for a File:
    Code:
            private static XElement CreateXMLTreeForFile(string file)
            {
                string localname = XmlConvert.EncodeName(file.Substring(file.LastIndexOf('\\') + 1));
    #if DEBUG
                Console.WriteLine("Writing file {0}...", localname);
    #endif
                return new XElement(localname, XmlConvert.EncodeName(ASCIIEncoding.ASCII.GetString(File.ReadAllBytes(file))));
            }
    When written directly as an XML file (using XDocument.Save(string)), I can open it in IE and view everything. But this resulted in very large files (about 40MB for 5MB of data), so I decided to compress it using the following functions.

    Code:
    private static void CompressXML(XDocument element, string filename)
            {
                MemoryStream mstream = new MemoryStream();
                XmlWriter xr = XmlWriter.Create(mstream);
                element.Save(xr);
                CompressFile(mstream.ToArray(), filename);
            }
    
            private static void CompressFile(byte[] file, string filename)
            {
                FileStream fs = new FileStream(filename, FileMode.Create);
                GZipStream gs = new GZipStream(fs, CompressionMode.Compress);
                gs.Write(file, 0, file.Length);
                fs.Close();
            }
    This results in a file about 1MB. But when I try to read it using:
    Code:
            private static XDocument DecompressXML(string filename)
            {
                DecompressFile(filename);
                XElement el = XElement.Load(filename + ".tmp");
                XDocument doc = new XDocument();
                doc.Add(el);
                return doc;
            }
    
            private static void DecompressFile(string filename)
            {
                FileStream fs = new FileStream(filename, FileMode.Open);
                GZipStream gs = new GZipStream(fs, CompressionMode.Decompress);
                int length = ReadAllBytesFromStream(gs);
                gs.Close();
                fs.Close();
                fs = new FileStream(filename, FileMode.Open);
                gs = new GZipStream(fs, CompressionMode.Decompress);
                byte[] data = new byte[length];
                gs.Read(data, 0, length);
                fs = new FileStream(filename + ".tmp", FileMode.OpenOrCreate);
                fs.Write(data, 0, data.Length);
                fs.Close();
            }
    It fails on
    Code:
    XElement el = XElement.Load(filename + ".tmp");
    with an invalid character.
    Originally I had written it to not write the file back to the hard drive to see if the problem was how I was reading the file, but that appears not to be the case.

    tl;dr: Halp plox!

    Edit
    Adding
    Code:
    string test = UnicodeEncoding.UTF8.GetString(data, 0, 1000);
    results in
    Code:
    "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\"?><WTF><Accou"
    You can't see the first character on this forum, but the first byte in the array is 0xEF. I'm not sure what this character is, or why it is there, but it is messing with my stuff.

    Edit 2
    So I'm pretty sure it has something to do with decompression. I wrote
    Code:
    public byte[] Compress(byte[] data)
            {
                MemoryStream ms = new MemoryStream();
                GZipStream gs = new GZipStream(ms, CompressionMode.Compress);
                gs.Write(data, 0, data.Length);
                return ms.ToArray();
            }
    
            public byte[] Decompress(byte[] data)
            {
                MemoryStream ms = new MemoryStream(data);
                GZipStream gs = new GZipStream(ms, CompressionMode.Decompress);
                byte[] decompressed = new byte[ReadAllBytesFromStream(gs)];
                ms = new MemoryStream(data);
                gs = new GZipStream(ms, CompressionMode.Decompress);
                gs.Read(decompressed, 0, decompressed.Length);
                return decompressed;
                
            }
    but for some reason 'decompressed' is the same as 'data'.

    From what I gather, from the example code, is you have any stream that gets a GZipStream wrapped around it. When you call GZipStream.Read(byte[],int,int), it puts the decompressed data from the encapsulated stream into the first parameter. *Thoroughly confused*
    Last edited by Aeixious; 01-16-2009 at 03:47 PM. Reason: More information

  2. #2
    Confused Magos's Avatar
    Join Date
    Sep 2001
    Location
    Sweden
    Posts
    3,145
    Check out BOM - Byte Order Mark. In your case, it sounds like UTF8.
    MagosX.com

    Give a man a fish and you feed him for a day.
    Teach a man to fish and you feed him for a lifetime.

  3. #3
    Registered User
    Join Date
    Jan 2009
    Posts
    2
    No, its nothing to do with the "Byte order mark".

    I wrote the following as a test. Simply take a file, read it, compress it, decompress it, then compare the data. Fails on EVERY file. The lengths are off, usually by one or two, and even if they were on, the data is different toward the end.

    Compression.cs
    Code:
    namespace Test
    {
        class Compression
        {
            public Compression()
            {
            }
    
            public static byte[] Compress(byte[] data)
            {
                MemoryStream ms = new MemoryStream();
                DeflateStream gz = new DeflateStream(ms, CompressionMode.Compress);
                gz.Write(data, 0, data.Length);
                return ms.ToArray();
            }
    
            public static byte[] Decompress(byte[] data)
            {
                MemoryStream ms = new MemoryStream(data);
                DeflateStream gz = new DeflateStream(ms, CompressionMode.Decompress);
                int length = StreamLength(gz);
                ms = new MemoryStream(data);
                gz = new DeflateStream(ms, CompressionMode.Decompress);
                byte[] dat = new byte[length];
                gz.Read(dat, 0, length);
                return dat;
            }
    
            private static int StreamLength(DeflateStream stream)
            {
                int i = 0;
                byte[] temp = new byte[1000];
                while (true)
                {
                    int read = stream.Read(temp, 0, 1000);
                    i += read;
                    if (read == 0)
                        return i;
                }
            }
        }
    }
    Program.cs
    Code:
    namespace Test
    {
        class Program
        {
            static void Main(string[] args)
            {
                Console.WriteLine("Reading File...");
                byte[] file = File.ReadAllBytes(@"file"); //TODO: Replace me with a file you have!
                Console.WriteLine("Compressing...");
                byte[] compressed = Compression.Compress(file);
                Console.WriteLine("Decompressing...");
                byte[] decompressed = Compression.Decompress(compressed);
                Console.WriteLine("Complete! Calculating...");
                if (file.Length != decompressed.Length)
                {
                    Console.WriteLine("This is broke!");
                    Console.WriteLine("File Length: {0}", file.Length);
                    Console.WriteLine("Decompressed Length: {0}", decompressed.Length);
                    Console.ReadLine();
                    return;
                }
                int incorrect_counter = 0;
                for (int i = 0; i < file.Length; i++)
                {
                    if (file[i] != decompressed[i])
                    {
                        incorrect_counter++;
                    }
                }
                if (incorrect_counter > 0)
                {
                    Console.WriteLine("This is broke!");
                    Console.WriteLine("{0} bytes are messed up.", incorrect_counter);
                    Console.ReadLine();
                    return;
                }
                Console.WriteLine("Yeah!");
                Console.ReadLine();
            }
        }
    }
    I ommited usings, but you need
    Code:
    using System;
    using System.IO;
    using System.IO.Compression;
    Output on a random 9MB file...
    Code:
    Reading File...
    Compressing...
    Decompressing...
    Complete! Calculating...
    This is broke!
    File Length: 10087472
    Decompressed Length: 10087471
    I'm usually not one to assume there is a bug... but this is about as simple as I can get it before manually compressing my computer into a wall. The sad thing is the sample code in MSDN works perfectly. I'm so very confused.

    Edit: Never mind I posted here. A gz.Close() was all I needed. Please Never speak of this again.
    Last edited by Aeixious; 01-22-2009 at 11:25 PM. Reason: IM A MORON

Popular pages Recent additions subscribe to a feed