Thoughts on a large problem

Printable View

09-13-2009
StainedBlue

Thoughts on a large problem

I am faced now with a large and open-ended task to allow users to upload Excel, XML, and even CSV files to the server, parse those files, perform validations, and ultimately move the data into the database.

As a constraint, I can not assume anything about how the user has organized their data. I should perform as much "intelligent" matching on the user's columns with the columns I expect. But where perfect matches aren't found, there should be a web page that allows the user to explicitly define what columns are what. This user preference should be stored in a XML file for future reuse.

A couple problems I face:

1)Excel files that are exported from the web or another program are not actually Excel files, (they're really just HTML tables with a .xls extension). Because of this, the Microsoft JET driver can't read them. I've looked into possible solutions, one being Microsoft Indexing Service. Apparently it is capable of discerning information about binary files, such as what the actual content is. Does anyone have any experience with this? If yes, is there any guidance or code-snippets you can share?

2)I've played around with a few different designs. Modularizing this project is obviously key to my sanity. One thing I would like to do is allow other developers to "plug-in" to the upload process with custom validation rules, and rules that define the structure of the data that is needed by the application. Like I said, I have several ideas that I've toyed around with, but I would love to hear some ideas from you. Sometimes an outside perspective is invaluable.
09-13-2009
Sebastiani

Well, I'm no Excel expert, but AFAIK all versions support importing of the CSV format. If so, you could make that de facto format for the data, and then just provide an interface where plug-ins that can parse different formats down to CSV can be added to the application.
09-13-2009
StainedBlue

Thank you for your suggestion.

While exporting to any format from the code-behind is certainly do-able, importing is not. Meaning, I can't take an uploaded file (in CSV or HTML) and convert it to Excel or vice versa.

Our process as of now is to have the users open up any Excel file they may have downloaded from another web site, and re-save it as an actual Excel file. This seems works for now, as our small group of users are competent enough with copmuters, but that might not be the case later on down the road as we take on more users.

Another oddity of this process is that even if the file is re-saved as an Excel doc, I always end up with some loss of table formatting, which, come parse-time, sees a few cells here and there as being empty, when they're clearly populated in the Excel file! Not a big deal now like I said, but certainly not acceptable going into the future of this project.

I have heard of ASPOSE.net which does all kinds of nifty things with Office products, however the developer seat costs are astronomical, and aren't going to happen for a project that's already shelled out the dough for 2 new servers, 2 licences of SQL Server enterprise and is now over a year behind schedule! (there were some serious problems with the previous developers)...
09-13-2009
Sebastiani

Quote:

While exporting to any format from the code-behind is certainly do-able, importing is not. Meaning, I can't take an uploaded file (in CSV or HTML) and convert it to Excel or vice versa.

Using Excel 2007, you can import CSV following the menu options Data->From Text->Get External Data->Text Import Wizard.

Another option is to convert everything to Office Open XML, which Excel should open without having to use an import wizard. I'm not really familiar with the format, personally, but you can check out the specs as defined by the ECMA-376 standard here. There are probably libraries out there that can make working with the format much easier, but chances are that they aren't free.

EDIT: It turns out that Microsoft actually has an OOXML SDK that might be useful.
09-13-2009
StainedBlue

>>Using Excel 2007, you can import CSV following the menu options Data->From Text->Get External Data->Text Import Wizard.

Right. But that can't be done programmatically, that's what I was trying to get at. No one is going to sit around opening up files as they're uploaded to the server. This must be done programmatically.

>>It turns out that Microsoft actually has an OOXML SDK that might be useful.

Nice find. I will definitely look into that to see what I can leverage.

Thank You.
09-13-2009
Sebastiani

Quote:

Right. But that can't be done programmatically, that's what I was trying to get at. No one is going to sit around opening up files as they're uploaded to the server. This must be done programmatically.

Got it. In that case then, any native Excel format would do (although I hear that their binary formats are poorly documented - you'd probably have better luck with something like OOXML).
09-13-2009
StainedBlue

Perhaps I should try to clarify the process a bit.

Initial)User has data file (could be Excel, XML, CSV) that they got from wherever.

1)User uploads file to server.

2)Our app performs some initial validation (checks file type).(This is the area we've been discussing, and needs a lot more intelligent file-handling)

3)The data is parsed into a DataSet object.

4)The business logic is introduced that will take the data from the DataSet, and move it into the database.

5)The uploaded file is deleted from the server.

Where I want to plug in to allow developer flexibility is in step 4. In step 4, a developer should be able to add rules (the business logic) to take the uploaded data (now in the DataSet) and move it to the Database. Because this should be a general, all-purpose process, I want to create a plug-able interface for other developers here.

This is the area that I've struggled with. What is the best way to allow developers to insert their own business logic needs here? Basically, a class should be created that has methods like, AddColumnRule, or AddValidationRule, etc, etc. It quickly becomes a high order of complexity as I enumerate the possibilities. Really, I need suggestions here.

EDIT: As an aside, the previous developers flat-out refused to implement this feature. Their concerns were over data integrity, but now I'm beginning to think it was the order of complexity!
09-13-2009
Sebastiani

Maybe I'm just confused here, but it seems that the plug-in logic should be done during step 3. That is, I'm assuming the DataSet object is some sort of factory for producing objects of type Date, Integer, String, etc - correct? Unless you're talking about some higher-level constraints, in which case it would be helpful to see an example of what these might be.
09-13-2009
StainedBlue

>>I'm assuming the DataSet object is some sort of factory for producing objects of type Date, Integer, String, etc - correct?

No. A DataSet is a built in .NET object that is really an in-memory table (or tables) of data. All step 3 does is gives me the collection of data in-memory to manipulate and introduce any business logic.

It works like this:

Code:

OleDbConnection excelConnection = new OleDbConnection( "Provider=Microsoft.Jet.Oledb.4.0;" + "Data Source=" + Server.MapPath(fullPath) + ";" + // path to location where I saved the file "Extended Properties=\"Excel 8.0;HDR=Yes;\""); OleDbCommand cmd = new OleDbCommand( @"SELECT * FROM [NameOfSpreadsheet$]", excelConnection); OleDbDataAdapter adapter = new OleDbDataAdapter(); adapter.SelectCommand = cmd; DataSet data = new DataSet(); adapter.Fill(data);

Where "data" is now an in-memory table that can be accessed like:

Code:

foreach(DataRow dr in data.Tables[0].Rows) { // Business logic would go here...

Of course, this is how Excel spreadsheets are handled. CSVs and XML files would be naturally handled differently. But putting them into a DataSet object as well would be desirable, so the business logic plug in would always be introduced in the same portion of the code
09-13-2009
StainedBlue

Here is kinda what I want for a plug-in:

Code:

DataRules rules = new DataRules(); // This class would be the plug-in I create rules .AddColumn("Name", typeof(string)); // Where "Name" is the column name containing string values .AddColum(etc, etc rules .AddValidationRule(Regex.someExpression, onColumnName)

Where the entire business-logic could be dictated in this fashion. I just need some guidance because I'm only so familiar with C# concepts like delegates and extension methods, and I'm not really sure what's the best way to leverage off these powerful features. C# has some very powerful features here indeed, yet it takes some time to wrap your head around some of them if you're relatively new to the language.

And perhaps there's an all together better way to go about this that's completely eluding me.
09-13-2009
StainedBlue

This means that a User-Control would be created to allow the basic file-uploading.

The User-Control would have a private field "DataRules" class if that's what I name it.

A developer would add rules via extension methods (for their ease of coding).

The process of moving data into the database from the DataSet (gratuitous use of the word "data" here!) would be dictated by this DataRules object.

That's where I'm at. Again, there may be another way or pattern that I don't know about.

What's really confusing to me, because I've never done anything like this, is I'm trying to allow other programmers to essentially write their own routine to move data to the database. At some abstract level it's like designing my own language. So somehow, the rules must me stored in a list or something because a dev. could have one rule or 100 rules! I can't assume anything here. I'm sure some of the experts here have dealt with a similar problem at some point.

Of course C# has delegates which would allow another developer to just write another function, so a DataRules class wouldn't even have to be designed in the first place. That's well and good, but other developers aren't working on this particular feature, and therefore would have to learn, as I am, about this whole process. So it's a problem of my coding ease versus theirs. I'm leaning toward their ease.
09-13-2009
Sebastiani

You'll have to forgive me for being a little distracted here (multitasking and such), but so I'm assuming that normally you'd just populate the database directly from the DataSet object as strings (had you known that it was valid in the first place, of course), so you really don't need the overhead of some sort of table class, just a way to validate it before it gets to that step, correct? If so, you could really keep it simple by using an array/container of 'row' validators. The size of the container dictates the number of columns required per row, and each element in the validator corresponds to a column of the same index. Then all you need to do is write a validators for each new data-type, so that when you want to define a new table validation scheme, you just create a container of validators to define the rules, so to speak, and instantiate a row validation object and associate it with the 'ruleset'. A stripped down example might look something like:

Code:

using System; using System.Data; using System.Data.OleDb; abstract class DataElementValidator { public abstract bool validate( String input ); } class IntegerDataElementValidator : DataElementValidator { public override bool validate( String input ) { try { int.Parse( input ); } catch( FormatException ) { return false; } return true; } } class FloatDataElementValidator : DataElementValidator { public override bool validate( String input ) { try { float.Parse( input ); } catch( FormatException ) { return false; } return true; } } // ...etc... class DataSetValidator { public DataSetValidator( DataElementValidator [ ] validators ) { set_validators( validators ); } public DataSetValidator( ) : this( null ) { } public bool validate( DataSet data_set ) { if( validators == null ) return false; foreach( DataRow row in data_set.Tables[ 0 ].Rows ) { object [ ] items = row.ItemArray; if( items.Length != validators.Length ) return false; int index = 0; foreach( object data in items ) if( !validators[ index ].validate( ( String )data ) ) return false; } return true; } public void set_validators( DataElementValidator [ ] validators ) { this.validators = validators; } protected DataElementValidator [ ] validators; } class Example { static void Main( ) { DataElementValidator [ ] validators = new DataElementValidator [ ] { new FloatDataElementValidator( ), new FloatDataElementValidator( ), new IntegerDataElementValidator( ) /* ...etc... */ }; DataSetValidator data_set_validator = new DataSetValidator( validators ); /* Connect to database and validate dataset... */ } }

Not a very advanced solution, of course, but probably sufficient for the task.
09-13-2009
StainedBlue

Wouldn't

Code:

Convert.ToInt32( row["SomeColumn"] ); // or Convert.ToDouble( row["AnotherColumn"] );

do the same thing?

Actually the kind of validation rules I need deal more with Regular Expressions, like stripping hyphens out of certain strings.

>>normally you'd just populate the database directly from the DataSet object as strings

I wish it were that simple. In reality, multiple queries will be used to come up with what actually is being inserted/updated to the DB, which leads me to believe my only option is to allow other developers to wire-up their routines via delegates.

EDIT: Your validation scheme could easily accomodate my needs by adding the Regular Expression. Very nice solution. Thank You.
09-13-2009
C_ntua

I am truly not an expert in C#. I am not sure I undestand exactly what you want. I will give it a try, but might have taken what you want completely wrong.

Lets say you want to store the square root of the integers. That would be a rule. Someone might also want to add only the first 10 rows. That would be another rule. Am I getting this correct?
In other words, lets say you have table A. You apply some rules and you get table B. Table B is stored in the database.

If that is the case then you just have to define some general "group of rules". Like:
a. Rules that change the data. Basically a function that makes calculations taking input the initial data and produces a result, which will be the new data.
b. Rules that decide which data is valid. Like only positive integers, not negatives.
c. Rules that change the table itself. Like, only the first 10 rows. Or add new rows with this and that data.
d. Other rules

Non-validated data will be excluded. Or there will be an error message. Or a default value. You will choose or the user can choose what to do in their validation rules.

So in order to implement rules, you just create the appropriate functions.
You can have a DataRule object, which has some delegates. One for every group of rules (as noted above). Each group will have certain output and certain input. So the user will create functions and then you will add them to the appropriate delegate

Then, the DataRule object will execute all rules with one specified order. You will have something like this:

Code:

DataRule r = new DataRule(DataSet input, DataSet output); r.addCalcRule(userFunction); //blah blah blah r.exec(); //execute all delegates

Now, the functions will be up to the user to create.

You can make it more simple for the user that wants to implement rules if you make it easy for the user to manipulate data. Like a function that returns a boolean table which is the result of a logical operation on the whole table. And stuff like that.

I don't believe it would be too much a fuzz for the user to create such functions if they knows basic C programming.

So the user will create the functions, upload them with the data, you will create a proper code file using the functions, compile it and execute the program created with input the data. And the output will be your new data stored in the database.

To cover all the rules available you need to use a language in the end. So the user will have to make some kind of code to create the rules.
09-13-2009
StainedBlue

C_ntua

Yes, you've got it pretty much right. That's pretty much what I'm going for, enumerate possible rules/validations allowing other developers to wire them up with delegates, yes.

However, the users who will upload the data are not the one's writing any code. There's a separation there (though it makes no difference for my part).

Users uploading data are assumed to be completely dumb, and not programmers.

Other developers might have to implement this uploading scheme, though they won't be uploading any real data themselves.

>>I don't believe it would be too much a fuzz for the user to create such functions if they knows basic C programming.

Lol. Maybe you haven't done a whole lot of web apps for windows, but only about half of the developers have ever written a single line of c-style code. The other half, are VB programmers...