Thread: Matching data to users

  1. #1
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262

    Matching data to users

    Hi everybody,

    I'm building a website in which I want to match events (concerts, fun thing to do in the area, etc.) to users. These events should be constrained to a location on either a country, state or city level. For example, one event might only be interesting to people who live inside one city, whereas another event might be interesting to anybody in the same country (for big events).
    However, these events should be listed in order of popularity (which will be a guess based on the number of comments and clicks). Finally, there should be a random factor involved to prevent the person from seeing the same events every time.
    Note that the results don't have to be exact, a good heuristic is fine.

    My questions:
    1. What are some good algorithms for this purpose?
    2. Is there any application I can use to do this efficiently?

    I prefer it to be extendible: I'd rather not have to re-write everything should I want to add some matching-parameters in the future.

    So far I'm using only PHP/MySQL, though I'm not sure if that's enough to do this efficiently.


    Thanks in advance

  2. #2
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I would just start with assigning each event a score of 1.0, and for each factor you want to consider, multiply it by some multiplier.

    Eg. For big events, same city = 1.0, different city = 0.5, different country = 0. And for small events, same city = 1.0, else 0. It can even be continuous based on how long the person would have to travel (1.0 - distance_in_miles/100).

    Then popularity can also be multiplier. For example, 0.3 + min(clicks / 1000, 0.7). There can be another multiplier for similarity to events the user was interested in in the past.

    Then the events can simply be sorted by final score, and picked from top few.

  3. #3
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    So far I'm using only PHP/MySQL, though I'm not sure if that's enough to do this efficiently.
    I'm not a PHP fan, but I've done some work with a mediawiki based site (the software that runs Wikipedia), which mediawiki/Wikipedia is written in PHP and uses MySQL, and that involves quite a bit more complexity than what you're talking about. I think load and hardware are the more serious factors, performance wise.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  4. #4
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Thanks for your replies guys!

    But my issue with a multiplier is that the score has to be calculated for each row in the table, for each user. If the number of rows and/or the number of users increases drastically, I can imagine the performance becoming terrible like that. It would be in the order of O(n log n), n being the number of events, if I am not mistaken. I'd rather decrease the time complexity somewhat, if possible.

    That is what I referred to about PHP/MySQL not being up for the job. I simply think a relational database won't be the best solution for this. Sure, I could write some sort of PHP daemon to do just this, which could be fast enough, but even in that case I need to know a decent algorithm.

    What I had in mind is something like Apache Solr. I'm not sure if Solr actually supports what I want or not, I haven't really used it (I looked at it a couple of years ago), but it seemed to support some quite complicated scoring techniques. Now that may have been O(n log n) as well, but judging from the speed of the queries I'd say it must use some heuristic of sorts.

    So, while I appreciate your answers, and I am suddenly not ruling out going for that (or a similar) solution, I do wonder: is there a "faster" solution using some clever heuristic? Preferably already implemented ;-).


    Thanks!

  5. #5
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    Yes, something like "Apache Solr" may do this better than a relational database.

    No, you aren't going to make a "PHP daemon" that offers more performance.

    You should only need to employee normalized "weights" to get a good relational database engine to do this well. That means doing incremental normalizing instead of trying to do this on demand. With this approach the "weighting" of events is calculated by virtue of sorting without any "multiplication" needed in the application.

    So, for example, say you have a list of major cities, zip codes, or some other natural relational quantity, a set of events, and a handful of users. You wouldn't have a single table for each of these things. You must have a heavily normalized database. Let's say we have a generic location table with a sorted index built from relative distance "weights". Let's say we have a "Music Events" table with a sorted index built from the popularity "weight". Let's say we have a table (for each user) with a set of "weights" for relative distances and event "weights" by preference. The values for the tables are mutated by their weights, normalized, and then added back to the tables. A "range query" on a modern database engine is now faster than anything you are likely to build specifically for this same purpose. The database only needs to traverse multiple ordered lists to return the events most relevant to a given user.

    Soma

  6. #6
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Thanks for your answer!

    Of course I'm not going to make a PHP daemon for this purpose, that was in reply to MK27's comment that PHP can be fast enough for this just fine. Of course I could implement any such algorithm in PHP, and performance would not be an issue. But I'd rather not, as I think there are easier and better solutions available. The fact that it can be done doesn't mean that I think it's a good idea though, and I never thought it was (I may have been unclear there).

    I'll need to have some more thoughts on your idea. I'm not completely sure I understand, but I think I understand where you're going with this. I'll get back to this if I have any more issues with this.


    Thanks for your help, guys!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. matching one data type to another
    By nuphonic in forum C++ Programming
    Replies: 6
    Last Post: 05-07-2007, 10:57 PM
  2. Matching user info with data in file
    By RVDFan85 in forum C++ Programming
    Replies: 2
    Last Post: 12-12-2006, 12:40 AM
  3. Users saving data on a text editor
    By htmlmaster in forum C++ Programming
    Replies: 3
    Last Post: 04-17-2005, 06:20 AM
  4. reading data from the users input
    By SpEkTrE in forum C++ Programming
    Replies: 1
    Last Post: 04-10-2005, 06:15 PM
  5. pattern matching
    By ahahplz in forum C Programming
    Replies: 5
    Last Post: 02-07-2003, 07:15 PM