Click to See Complete Forum and Search --> : GROUP BY over 30billion entries ?


Mark43
09-27-2006, 09:49 AM
Hi.
I have some trouble with my MySql Database. I have a table with about 30 billion URLs and i need the most popular urls (TOP100). But my sql statement needs hours.
How can I optimise my Statement or my Database?

select URL, count(url) as urlcount
from urltable
group by URL
order by urlcount;

Thanks for your help!
Mark

chazzy
09-27-2006, 09:58 AM
what indexes do you have on the table?

Mark43
09-27-2006, 10:33 AM
I tried an BTREE Index for the URLS. But it doesn't speed you my query.

chazzy
09-27-2006, 11:38 AM
what do you mean by a BTREE index?

I'm talking about a physical index on the table, created by using the CREATE INDEX statement.

Mark43
09-27-2006, 11:47 AM
I created an physical index like this:

create index urlindex USING BTREE on urltable (URL(100));

mattyblah
09-29-2006, 01:22 PM
sounds like you have a design problem. It also sounds like you are constantly entering in urls to this table. What I would do is break the table down into another table structured like:

url_id int autoincrement unclustered index
url varchar(255) clustered index
count int default 1

Then, when a new url is to be entered, check if the url exists. If it does, update the count + 1, otherwise enter in a new url with 1 as the count. Not sure how mysql would hand that but in sql server i'd write something like this:

if exists (select 1 from urltable where url = '$url')
update urltable set count = count + 1 where url = '$url'
else
insert into urltable (url) values ('$url')

again this is a combination of pseudocode and sql code but i think you should get it to work fairly easy.

chazzy
09-29-2006, 05:28 PM
I created an physical index like this:

create index urlindex USING BTREE on urltable (URL(100));

Creating a single column index typically has no effect. I would go w/ matty's recommendation to keep the count in a column and update it, if all you're doing is creating a table w/ the url column. the update will be faster.

Ultimater
10-03-2006, 07:00 AM
If you put Mattyblah's idea to use, your table would look something like this:

---------- url_popularity ----------
url_id count url
1 7 http://www.yahoo.com
2 1 http://www.blackforest.com
3 1 http://www.web-sniffer.net
4 3 http://www.msn.com
5 2 http://www.aol.com
6 14 http://www.google.com
7 1 http://www.httpguru.com
8 2 http://www.php.net

...

1000230420 1 http://www.myspace.com/ultimater

Then use his pseudosql to keep the table up-to-date:

if exists (select 1 from urltable where url = '$url')
update urltable set count = count + 1 where url = '$url'
else
insert into urltable (url) values ('$url')

However you will still be using GROUP BY on nearly as many entries as the first approach becuase of all the one-time occurring URLs.



I would suggest taking Mattyblah's idea to the next level and create yet another table in addition to his that will contain only the top 100 entries.

---------- top100_urls ----------
id count url
1 7 http://www.yahoo.com
2 14 http://www.google.com
3 4 http://www.webdeveloper.com

...

100 2 http://javascript.internet.com


By doing so, you are able to get the lowest count of the top 100 very quickly in order to determine whether or not a new entry will be eligible for the top 100.

To manage these tables, let's say a user wants to add http://www.webdeveloper.com to the database.
1. First it searches the url_popularity table if a record already exists with the given URL, if it exists then add one to the count otherwise create a new record for the URL.
2. Second search the top100_urls table if a record already exists with the given URL, if it exists then add one to the count.
3. Call the min() MySQL function for the top100_urls table in order to get the lowest "count" to beat in order to make the top 100.
4. If the new entry is eligible for the top 100, then add it to the top 100 list and then check if the table contains over 100 records, if it does, then remove the record with the lowest count.



However if users will be able to also remove URLs and cause the URL count of one of the top 100 to go down, it won't be so easy to manage the top100_urls table without major lag and my top100_urls table might be a bad idea.

mattyblah
10-06-2006, 03:03 PM
If you put Mattyblah's idea to use, your table would look something like this:

---------- url_popularity ----------
url_id count url
1 7 http://www.yahoo.com
2 1 http://www.blackforest.com
3 1 http://www.web-sniffer.net
4 3 http://www.msn.com
5 2 http://www.aol.com
6 14 http://www.google.com
7 1 http://www.httpguru.com
8 2 http://www.php.net

...

1000230420 1 http://www.myspace.com/ultimater

Then use his pseudosql to keep the table up-to-date:

if exists (select 1 from urltable where url = '$url')
update urltable set count = count + 1 where url = '$url'
else
insert into urltable (url) values ('$url')

However you will still be using GROUP BY on nearly as many entries as the first approach becuase of all the one-time occurring URLs.



I would suggest taking Mattyblah's idea to the next level and create yet another table in addition to his that will contain only the top 100 entries.

---------- top100_urls ----------
id count url
1 7 http://www.yahoo.com
2 14 http://www.google.com
3 4 http://www.webdeveloper.com

...

100 2 http://javascript.internet.com


By doing so, you are able to get the lowest count of the top 100 very quickly in order to determine whether or not a new entry will be eligible for the top 100.

To manage these tables, let's say a user wants to add http://www.webdeveloper.com to the database.
1. First it searches the url_popularity table if a record already exists with the given URL, if it exists then add one to the count otherwise create a new record for the URL.
2. Second search the top100_urls table if a record already exists with the given URL, if it exists then add one to the count.
3. Call the min() MySQL function for the top100_urls table in order to get the lowest "count" to beat in order to make the top 100.
4. If the new entry is eligible for the top 100, then add it to the top 100 list and then check if the table contains over 100 records, if it does, then remove the record with the lowest count.


However if users will be able to also remove URLs and cause the URL count of one of the top 100 to go down, it won't be so easy to manage the top100_urls table without major lag and my top100_urls table might be a bad idea.

Not sure how my idea would use a group by clause in it. All you would have to do is something like:

SELECT TOP 100 * FROM table ORDER BY [count] DESC

and that would return the top 100 urls. Perhaps we're not understanding each other. If it was still a problem with speed you could create another table that holds the top 100 that was updated every hour with a cron job. That could also work.