I recently developed an algorithm that can be used for average rating calculations in content rating systems. It allows an average rating to be re-calculated upon a new vote only with simple mathematical operations, without the previous ratings having to be retrieved one-by-one from a database, summed up and then divided by their total number like in most traditional solutions based on the standard arithmetic mean equation.
Hmm...do you have any empirical evidence that all of that is faster than simply doing:
Code:
SELECT
AVG(rating) AS avg_raging,
other_cols_of_interest_here
FROM table_name
WHERE some_id = <some_value>
GROUP BY some_id
(assuming some_id is indexed)
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
Yes, I did a quick benchmark test on a small data set (12 rows). What I did was, insert a new grade to the database for a corresponding ID, then re-calculate the new average rating for that corresponding ID and output it to the screen.
Using MySQL AVG(): Finished in 0.001843 seconds
Using my algorithm: Finished in 0.001113 seconds
0.0007 seconds of difference and we're dealing with only 12 rows, plus the database server and the web server are located on the same machine.
That's without retrieving the total number of ratings forming the average rating for the given ID. If I wanted to do that, I would have to execute a statement like this:
Code:
SELECT
AVG(rating) AS avg_raging, COUNT(rating) as votes_count
FROM table_name
WHERE some_id = <some_value>
GROUP BY some_id
In which case the benchmark results on the same data set are as follows:
Using MySQL AVG(): Finished in 0.001865 seconds
Using my algorithm: Finished in 0.001128 seconds
0.0007 seconds may seem like a tiny difference on small data sets, but imagine you're dealing with a popular website, with a few hundred millions of unique visitors per month..
0.0007 seconds may seem like a tiny difference on small data sets
It may appear just the variable mistake of experiment (delay because of some hardware interrupt, for example), so you need to test on large amount of data to obtain any meaningful comparison. It is of little use to compare times less than quantum of system time allowed for each process (which is usually 30-50ms) so you'd better try to generate data sets which took about 1 second to process query on them...
Well, I just finished a benchmark script that allows you to test with large data sets. I ran a test on a data set of 100 000 rows, then on 1 000 000 rows, and my algorithm is still the fastest.
Here is a link if you're interested in running it from your server or if you just want to check the source code of the benchmark, so you can see how it works: Algorithm Benchmark
And here is that very same benchmark script running on a 1 000 000 record data set on my server, in case you want to do a quick online test: Online Algorithm Benchmark
Bookmarks