Thread: From word 2003 to xml

  1. #1
    Registered User
    Join Date
    Feb 2011
    Posts
    3

    From word 2003 to xml

    Hi there!

    I want to build an application that reads word files and generates xml files with data from the .doc file.

    e.g.

    .doc file:

    Metallica (Ride the Lightning) - 1985, Metal

    xml:

    Code:
    <cd>
     <band>Metallica</band>
     <album>Ride the Lightning</album>
     <year>1985</year>
     <genre>Metal</genre>
    </cd>
    I assume i need COM but can't figure out advanced searching like "find all bold text that comes before a bold left parenthesis and ignore whitespace"... (..a (.. to seperate band from album). I think of COM in combination with regex but i don't know where to start from...

    Any thoughts/tuts?

  2. #2
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    You had the right idea. I came up with this:
    Code:
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using Microsoft.Office.Interop.Word;
    
    namespace WordReader
    {
        class Program
        {
            static void Main(string[] args)
            {
                Application app = new Application();
                string path = Environment.CurrentDirectory + @"\..\Albums.docx";
                Document doc = app.Documents.Open(path, ReadOnly: true);
    
                Regex pattern = new Regex(@"^(?<artist>.+) \((?<title>.+)\) - (?<year>.+),(?<genre>.+)");
                List<Album> albums = new List<Album>();
                foreach(Paragraph paragraph in doc.Paragraphs)
                {
                    Match match = pattern.Match(paragraph.Range.Text);
                    if (match.Success)
                    {
                        Album album = new Album();
                        album.Artist = match.Groups["artist"].Value;
                        album.Title = match.Groups["title"].Value;
                        album.Year = match.Groups["year"].Value;
                        album.Genre = match.Groups["genre"].Value;
                        albums.Add(album);
                    }
                }
    
                app.Documents.Close();
                ((_Application)app).Quit();
    
                foreach (Album album in albums)
                {
                    Console.WriteLine("Artist: {0}", album.Artist);
                    Console.WriteLine("Title: {0}", album.Title);
                    Console.WriteLine("Year: {0}", album.Year);
                    Console.WriteLine("Genre: {0}", album.Genre);
                    Console.WriteLine();
                }
            }
    
            struct Album
            {
                public string Artist;
                public string Title;
                public string Year;
                public string Genre;
            }
        }
    }
    Output:
    Code:
    Artist: Metallica
    Title: Ride the Lightning
    Year: 1985
    Genre:  Metal
    
    Artist: Metallica
    Title: Master of Puppets
    Year: 1986
    Genre:  Metal
    Last edited by itsme86; 02-17-2011 at 11:34 AM.
    If you understand what you're doing, you're not learning anything.

  3. #3
    Registered User
    Join Date
    Feb 2011
    Posts
    3
    This is very helpful, thanks! However the objective is to generate xml files with that data. I also thought of using ms word wildcards as possible alternative to regex, but i think they are not as wide as regex... i'll give it a shot and be back for news...

    By the way, i think struct is not available in C#, is it?
    Last edited by Lemmyz; 02-18-2011 at 02:47 AM.

  4. #4
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Quote Originally Posted by Lemmyz
    However the objective is to generate xml files with that data.
    Well, it sounded like the part you were having problems with was reading/parsing the content in the .doc file. Generating the XML once you have the fields is trivial.

    By the way, i think struct is not available in C#, is it?
    Of course it is.
    If you understand what you're doing, you're not learning anything.

  5. #5
    Registered User
    Join Date
    Feb 2011
    Posts
    3
    Yep it is... :P...

    Actually i'm not satisfied with my approach so far... i've been facing a problem in getting the content of each paragraph (formatting included) of a document, store it in a custom object and then perform regex or wildcard searching to each object...

    e.g. in .doc file:

    Metallica (Ride the Lightning) - 1985, Metal¶

    Megadeth (Rust in Peace) - 1990, Metal¶

    Ethereal Blue (Essays in Rhyme on passion & Ethics) - 2010, Prog Death Metal¶


    So 3 paragraphs would make 3 Paragraph objects from which i should extract the xml data and generate 3 different xmls.
    But it looks harder than i thought..
    another thing is that in this document there are 5 paragraphs (that's what paragraphs.count gives) while WdStatistic.wdStatisticParagraphs returns 3 (the ones with text in them.... which is what i need...)
    Last edited by Lemmyz; 02-21-2011 at 03:24 PM.

  6. #6
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    Why are you not using the code I provided? The boldness of the font doesn't matter using the regex pattern I supplied. Doesn't it read the Word doc correctly? Obviously you should change the last foreach() to render XML instead of just output to the console, but the Album list creation should work fine.
    If you understand what you're doing, you're not learning anything.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. gcc compile errors, HELP!
    By FalconGK81 in forum C Programming
    Replies: 11
    Last Post: 02-14-2011, 04:17 AM
  2. Seg Fault in Compare Function
    By tytelizgal in forum C Programming
    Replies: 1
    Last Post: 10-25-2008, 03:06 PM
  3. brace-enclosed error
    By jdc18 in forum C++ Programming
    Replies: 53
    Last Post: 05-03-2007, 05:49 PM
  4. Wrong Output
    By egomaster69 in forum C Programming
    Replies: 7
    Last Post: 01-28-2005, 06:44 PM
  5. FILES in WinAPI
    By Garfield in forum Windows Programming
    Replies: 46
    Last Post: 10-02-2003, 06:51 PM