Help with SQL
Clarification: I don't know what's up with me and thread titles lately. Can't seem to do it right. I don't need help with the actual SQL. I need help with database design.
I need some help coming with a good database design for the following problem:
- Each article in a store has several properties (be it size, color, dimensions, type, etc...)
- Article properties can be integers, floats, text or enums
- Properties aren't known in advance. That is, they can't be part of the DDL, they are actual data. Properties are meant to be created on demand, if required, as new articles make it into the store.
- Properties may not always be reusable. That is, while an article "size" property can be defined as an enum (a shirt can be small, medium, large), another article can have a "size" property defined as an integer (shoes).
This defines the first part of the problem. A good database design that allows me to implement and maintain variable properties.
The second part of the problem extends the former problem by introducing "categories". You can ignore the rest of this post, by providing an answer to the first part only.
Products are supposed to also be categorized. A category simply classifies the product within a hierarchy. So, an article like Rebook Princess Shoes may be classified as "Shoes"->"Female"->"Sneakers", for instance. The same problem exists, though. All categorization and relationships cannot be defined as part of the DDL since they are actual variable data that is bound to change as the store introduces new articles into its stock, removes others or decides to reclassify existing ones.
Tempting solution to the second part:
Once I find a solution to the first part, I'm tempted to look at categories as just another property and design the database that way. That is, I simply incorporate into the first solution a parent_id field and the field that describes a property data type can be left null or be defined as "category". Root records are considered properties, all others are categories, while the root category references itself.
Could you model something around XML and DTD's?
The latest namespace-aware version of XML (I think) allows you to build hierarchies. So for example a top-level of clothes could be then sub-divided into shirts and shoes (each with their own unique view of 'size').
DTD's give you validation, ensuring that the structure of an actual XML representation of your data is consistent with your description of what you expect it to contain.
Hmmm the first problem is a tricky one but I think something along the lines of this can be a solution, not sure how good of a solution it is though:
You create another table that has 2 foreign keys, 1 into the articles table, one into the properties table. This will be your container for properties for each article
Now since you want to have variable property types you have to decide which way to go here, either you make a big table with many columns, 1 for each type you want to support as property, you could go a polymorphic route but this would require to update the application as you update OR make the "base" instance of the property aware of in what table the data can be found or you can make the data a string only and parse it in the program.
To expand no what i mean with the polymorphic route:
you have a "base" table Property, this table holds information about all properties, such as name and whatnot. You then create several tables such as EnumProperty, FloatProperty and so on. These all foreign reference the base table and only contain the data. Now what i meant with the base instance being aware of the table the data can be found in was that property has a column "DataTable" that simply specifies which table the data can be found in (EnumProperty, FloatProperty and so on).
This should give you something along the lines you want but not sure if it is the best solution.
>> Could you model something around XML and DTD's?
The data model for this system (in fact, eventually and hopefully, a series of systems to form a complete Retail Management Solution) will be complex and invariably there will be a desire to adopt OpenXML, and SQLXML in the DAL. Any hierarchic information is a good candidate no doubt, depending on its usage. However data still has to be stored relationally on the database and here's where I have my doubts on how best to do this.
I'm not entirely discarding the possibility of storing XML on the database, mind you. I still need to learn about the xml data type. I'm a long-time relational guy, so that's probably a habit I need to kill. But of the top of my head the xml data type would, for instance, immediately solve my problem with variable data types. I'll definitely need to study your suggestion.
>> Now since you want to have variable property types you have to decide which way to go here, either you make a big table with many columns, 1 for each type you want to support as property, you could go a polymorphic route but this would require to update the application as you update OR make the "base" instance of the property aware of in what table the data can be found or you can make the data a string only and parse it in the program.
Decoupling is the name of the game here. Anything that closely marries client and server is a big no-no for us. We would rather break normalization than closely couple modules or APIs.
I'm hoping more for a proven solution though. Something someone had achieved already with satisfactory results.
This is a perfect example of the limitations of relational designs. What you need is a something that can model transitive hierarchies (If I'm in digital cameras, then I'm in cameras, and if I'm in cameras than I'm in electronics, and if I'm in electronics than I'm in Products, etc...) and self-describing schemas. What you basically want is RDF, or some variation of it.
The transitive hierarchies give you the ability to process queries like "Find me electronics" and know that a digital camera is a type of electronics. The self describing schema lets you add new types of products, categories, and attributes without redesigning your db. Also, this type of model lets you have sparse attributes without wasting space with null values.
You can store <id, property, value> triples in a three column table. On of those properties is a special "category" property that is transitive. You probably wan to precompute the closure since inferring it at query time involves an unknown number of SQL queries (or a recursive SQL query).
You could try an RDF store if you don't want to build the logic for this yourself (Something like Jena or Virtuoso). They generally take SPARQL queries (I think virtuoso will process SQL as well though).
I could babel about these types of data models for days, so just ask if there's some other info you want.
Excellent! I'll study these solutions and come back if something more is needed. Thanks a bunch.
I've been looking at ways of implementing RDF techniques into an RDBMS. In parallel I've been also looking into EAV models as a means to describe "property bags" (dynamic properties). In the context of both, I've also been looking at ways to use XML to my advantage.
I'm not happy with what I've been finding. As Perspective puts it, this crosses the limits of the relational model, which in the end means that no matter where I look, there will always be a series of disadvantages that either compromise recommended design choices and data integrity, and complicate design, or complicate code.
So this is my tentative solution to the problem. I'll start by describing what's the system for where this will be developed:
- MS SQL Server 2008 R2 (and above)
- .Net clients running on POS checkouts (Panasonic, Fujitsu and IBM, mostly)
- .Net clients running on Windows PCs (accessing Reporting Services and Analysis Services)
RDF: I'm having an hard time defining a RDF data storage system that I can fit into a RDBMS and I can then managed from my data access layer. I think the biggest problem is the difficulty I'm having in understanding the model.
XML: The problem being that, contrary to popular saying, XML isn't really self-describing, is it? It still depends on an higher layer to give meaning and properly validate the information therein. Basically, XML alone doesn't do me any good. As I was trying solutions around it I kept hitting the exact same walls of the RDBMS model.
I've conformed myself to an EAV model knowing I'll be losing on referential integrity and will be facing problems with type validation. The reasons I chose this are:
- SQL Server 2008 adds the SPARSE extension, which optimizes storage space on columns with a large number of NULL values (it simply doesn't store them). It also optimizes indexes on those columns (only non-null values are indexed).
- With this, I'm planning to make the Value table a full sparse matrix table with one column per data type and define the property data type as metadata in the Properties table. I'm not required to create one table per data type anymore.
- I'm not doing anything that is any worse than all other possible solutions. In the end, what I seem to perceive, is that no matter what I choose, this will always be a mess and a weak spot on the overall data model of the system. With a EAV solution, I'm at least doing something I'm used to (sticking to the relational data model).
I'm however far from happy. I'll keep finding better solutions to this problem. But for now, I really need to move on to other fights. Let me know if you think I'm missing something, or if you think this is really, really, a dumb idea.