# Thread: Matching data to users

1. ## Matching data to users

Hi everybody,

I'm building a website in which I want to match events (concerts, fun thing to do in the area, etc.) to users. These events should be constrained to a location on either a country, state or city level. For example, one event might only be interesting to people who live inside one city, whereas another event might be interesting to anybody in the same country (for big events).
However, these events should be listed in order of popularity (which will be a guess based on the number of comments and clicks). Finally, there should be a random factor involved to prevent the person from seeing the same events every time.
Note that the results don't have to be exact, a good heuristic is fine.

My questions:
1. What are some good algorithms for this purpose?
2. Is there any application I can use to do this efficiently?

I prefer it to be extendible: I'd rather not have to re-write everything should I want to add some matching-parameters in the future.

So far I'm using only PHP/MySQL, though I'm not sure if that's enough to do this efficiently.

2. I would just start with assigning each event a score of 1.0, and for each factor you want to consider, multiply it by some multiplier.

Eg. For big events, same city = 1.0, different city = 0.5, different country = 0. And for small events, same city = 1.0, else 0. It can even be continuous based on how long the person would have to travel (1.0 - distance_in_miles/100).

Then popularity can also be multiplier. For example, 0.3 + min(clicks / 1000, 0.7). There can be another multiplier for similarity to events the user was interested in in the past.

Then the events can simply be sorted by final score, and picked from top few.

3. So far I'm using only PHP/MySQL, though I'm not sure if that's enough to do this efficiently.
I'm not a PHP fan, but I've done some work with a mediawiki based site (the software that runs Wikipedia), which mediawiki/Wikipedia is written in PHP and uses MySQL, and that involves quite a bit more complexity than what you're talking about. I think load and hardware are the more serious factors, performance wise.

4. Thanks for your replies guys!

But my issue with a multiplier is that the score has to be calculated for each row in the table, for each user. If the number of rows and/or the number of users increases drastically, I can imagine the performance becoming terrible like that. It would be in the order of O(n log n), n being the number of events, if I am not mistaken. I'd rather decrease the time complexity somewhat, if possible.

That is what I referred to about PHP/MySQL not being up for the job. I simply think a relational database won't be the best solution for this. Sure, I could write some sort of PHP daemon to do just this, which could be fast enough, but even in that case I need to know a decent algorithm.

What I had in mind is something like Apache Solr. I'm not sure if Solr actually supports what I want or not, I haven't really used it (I looked at it a couple of years ago), but it seemed to support some quite complicated scoring techniques. Now that may have been O(n log n) as well, but judging from the speed of the queries I'd say it must use some heuristic of sorts.

So, while I appreciate your answers, and I am suddenly not ruling out going for that (or a similar) solution, I do wonder: is there a "faster" solution using some clever heuristic? Preferably already implemented ;-).

Thanks!

5. O_o

Yes, something like "Apache Solr" may do this better than a relational database.

No, you aren't going to make a "PHP daemon" that offers more performance.

You should only need to employee normalized "weights" to get a good relational database engine to do this well. That means doing incremental normalizing instead of trying to do this on demand. With this approach the "weighting" of events is calculated by virtue of sorting without any "multiplication" needed in the application.

So, for example, say you have a list of major cities, zip codes, or some other natural relational quantity, a set of events, and a handful of users. You wouldn't have a single table for each of these things. You must have a heavily normalized database. Let's say we have a generic location table with a sorted index built from relative distance "weights". Let's say we have a "Music Events" table with a sorted index built from the popularity "weight". Let's say we have a table (for each user) with a set of "weights" for relative distances and event "weights" by preference. The values for the tables are mutated by their weights, normalized, and then added back to the tables. A "range query" on a modern database engine is now faster than anything you are likely to build specifically for this same purpose. The database only needs to traverse multiple ordered lists to return the events most relevant to a given user.

Soma