Click to See Complete Forum and Search --> : 256 or 4096 tables?
Znupi
02-01-2008, 06:11 PM
For search speed reasons, I want to split a multi-million-row table into many tables postfixed with two or three hexadecimal digits. This would mean to split it into 256 or 4096 tables. Obviously, I would obtain better speeds with 4096 tables (considering I know precisely in which table I have to search), but it sounds a bit scary to have thousands of tables in a database (never did it before :-s). All these tables would also have indexes on 3 of their 4 numerical columns.
Can you give me any piece of advice (pros/cons) regarding the two options? I expect the configuration to work with a few tens of millions of rows.
Also, can you give me any idea about the disk space overhead in such situations?
Thanks :)
PS: I'm using MySQL, if that matters.
chazzy
02-01-2008, 06:38 PM
I'm not sure how you see that splitting it up over all these tables is actually going to gain you anything in performance. you're still limited to the capacity of the server it's running on.
Znupi
02-01-2008, 06:47 PM
I'm not sure how you see that splitting it up over all these tables is actually going to gain you anything in performance. you're still limited to the capacity of the server it's running on.
The gain in performance is pretty obvious while I'll be searching in a table of a thousand rows instead of a table with millions of rows. The search speed is my primary concern; storage, while it presents interest, is secondary.
felgall
02-01-2008, 09:15 PM
If you know the key of the record you are looking for in the database then the database will find the fastest way to access it. You would do better to add indexing on the appropriate fields to allow the lookup to go faster. Splitting the table will not have any speed impact unless the table isn't properly indexed in the first place.
chazzy
02-01-2008, 10:13 PM
Znupi, I think you're discrediting what an index does, and maybe even under estimating it. Maybe I'm reading your first post wrong, but you write indexes on 3 of their 4 which has me thinking. Does he mean #1 or #2?
#1
create index my_index on table1(col1,col2,col3);
#2
create index my_index_1 on table1(col1);
create index my_index_2 on table1(col2);
create index my_index_3 on table1(col3);
Where #1 improves performance of queries like this:
select columns from table1 where col1 = value and col2 = value and col3 = value
and #2 improves the following query style:
select columns from table1 where col1 = value or col2 = value or col3 = value
Obviously value can be anything even in the separate iterations.
Also, you could create all 4 indexes, which would speed up the select on the where clause w/ all 3 items entered, any single one entered.
But either way, I think you should be looking more in to how indexes work before you go ahead and create so many tables.
Znupi
02-02-2008, 08:03 AM
My current multi-million table is properly indexed on the columns I perform my searches. Also, the 256 or 4096 also would have proper indexes on the same columns.
The index mechanism drastically improves the search speed on tables, but this speed is still dependent on the table size. If we take two identical tables, with the same columns and indexes but with totally different number of rows and perform a search on one of it's indexed columns, the smaller table will always perform better. The search I am talking about and using is a simple one, performed after a single numerical non-unique indexed column.
Probably what I should have made more clear is that I know prior to the search itself the exact table (from the 256 or 4096) in which I will find my results.
felgall
02-02-2008, 03:08 PM
The time it will take in the code you wrap around the database call to select which of the tables to do the lookup in will take at least as long if not longer than the extra time required for the single table lookup since the index lookup is effectively already using the best way of handling working out which section of the index the wanted results are in. Overall taking your extra code to handle the multiple tables into account the time for the call using multiple tables will be the same or longer than the one table version.
What you are talking about doing is not reducing the volume of data to be searched to find the wanted results. You are looking at taking part of the lookup away from the database and writing your own code to handle that part of the search before calling the database. Assuming you write the most efficient code possible for that you will end up with the same lookup times as if the data were all in one table with the data just being harder to maintain.
Znupi
02-03-2008, 06:12 AM
What you are talking about doing is not reducing the volume of data to be searched to find the wanted results. You are looking at taking part of the lookup away from the database and writing your own code to handle that part of the search before calling the database. Assuming you write the most efficient code possible for that you will end up with the same lookup times as if the data were all in one table with the data just being harder to maintain.
I understand very well what you are saying and that you think that I'm trying to re-invent the wheel. In a way you are right, I am trying to take a part of the lookup from the database, but there are certain situations in which this can be done in a very simple and efficient manner.
Let's say you have a simple table called "dictionary" with two fields: "word" and "definition". The only operation you what to perform on this table is searching a definition (by word). Now let's say you split this table in "dictionary_a", "dictionary_b", etc, having all the words starting with "a" in the "dictionary_a" table and so on. Now, the overhead of determining prior to the search itself the first letter of the word you are searching for is minimal (and, in my case, performed on another machine). So, the search itself will be performed in a much smaller table.
I know that this is not a very good example and that in this case SQL's indexes would do all this splitting for you, but please believe me that there are situations when implementing some kind of efficient pre-indexing taking in consideration a particular application and database structure is common practice while handling very large amounts of data. SQL's indexing system, though very efficient, is a general solution and thus it doesn't take advantage of the particularities of a given project.
felgall
02-03-2008, 12:52 PM
The database will go one better than your split by splitting the content into exactly equal sections for the lookup rather than the unequal chu8nks you propose to use based on starting letters. The database lookup will therefor be more efficient than yours.
If the SQL indexing system is not doing the lookup based on the spread of data that you have then use the appropriate database tools to tell the database how the data is actually distributed so that it can choose the most efficient process. The database only uses the general rules if it doesn't have additional information to see how to improve on it.
The whole point of relational databases is that if you tell the database how you expect the data to look then it can come up with the most efficient method possible for accessing it.
FeelLikeANut
02-03-2008, 10:42 PM
I know that this is not a very good example and that in this case SQL's indexes would do all this splitting for you, but please believe me that there are situations when implementing some kind of efficient pre-indexing taking in consideration a particular application and database structure is common practice while handling very large amounts of data.I can't think of a situation where indexing would be insufficient either. But now I'm really curious. Can you tell us what your situation is, Znupi?
felgall
02-03-2008, 11:45 PM
That is the sort of action that is taken by someone who doesn't properly understand relational databases. Anything that you take out of the database and hard code your own code for will need to always work that way. Provided that you configure the database correctly you can get the database itself to internally perform the same lookup processing. If the spread of data changes significantly you can tell the database about the change and it will reconfigure the way it handles the lookup rather than your having to rewrite the code you stuck on the front end as well as redesigning the database.
FeelLikeANut
02-04-2008, 10:21 AM
That is the sort of action that is taken by someone who doesn't properly understand relational databases.That may be so, but you don't help people learn by telling them they're stupid. I think chazzy is a good example to follow here. He has a way of saying "I think you're wrong" without coming off as hostile or demeaning.
Znupi
02-04-2008, 03:54 PM
The database will go one better than your split by splitting the content into exactly equal sections for the lookup rather than the unequal chu8nks you propose to use based on starting letters. The database lookup will therefor be more efficient than yours.
You are absolutely right, this is exactly how I was planning to do it. That was just for the sake of giving a simple example. As I was mentioning in my initial post, I will split my table “into many tables postfixed with two or three hexadecimal digits“, these digits being the first ones of the MD5 hash of the words, which will assure a more equally distribution of the data among these tables.
can't think of a situation where indexing would be insufficient either. But now I'm really curious. Can you tell us what your situation is, Znupi?
I think we all agree that a search engine doesn't hold all of its results in a huge table, relying on SQL indexes for search speed results. I know a little bit about distributed databases, but, at least for now, I have reasons not to begin working on deploying such a system. But there are many ways of “distributing” (as a more general meaning), on of them being the SQL indexing system itself. You just have to think a little bit outside the box to realize that it is not the only one, and that it's performance can be improved through certain actions for a particular application.
That is the sort of action that is taken by someone who doesn't properly understand relational databases
I have a master degree in Computer Science, a major in Artificial Intelligence and plenty of experience with large scale web-based applications. I also know a little bit about the internal structure of a database and the rules that govern a database engine, including indexing. But if you say I don't understand a relational database, probably you are right. I also think that further discussions on this particular offtopic subject would be irrelevant.
This has become an interesting debate and will continue posting here in support of my opinions, but I would very much appreciate an answer regarding my initial post. Thank you.
FeelLikeANut
02-04-2008, 05:12 PM
But there are many ways of “distributing” (as a more general meaning), on of them being the SQL indexing system itself. You just have to think a little bit outside the box to realize that it is not the only one, and that it's performance can be improved through certain actions for a particular application.But what you're proposing isn't distributing in any useful way. Your entire database is still using the same CPU, the same RAM, and the same hard disk. As Chazzy said, "You're still limited to the capacity of the server it's running on."
The beauty of an index is that it lets you find rows in O(1) time, or at worst O(log n) time, depending on the implementation (tree versus hash). Being a computer science major, I'm sure you know that O(1) time means the lookup happens in constant time regardless of the size of the data structure. I'll once again quote Chazzy: "I think you're discrediting what an index does, and maybe even under estimating it."
Znupi
02-04-2008, 06:26 PM
The beauty of an index is that it lets you find rows in O(1) time, or at worst O(log n) time, depending on the implementation (tree versus hash). Being a computer science major, I'm sure you know that O(1) time means the lookup happens in constant time regardless of the size of the data structure. I'll once again quote Chazzy: "I think you're discrediting what an index does, and maybe even under estimating it."
First of all I want to say I appreciate your comment. It is insightful and still you are not patronizing.
If I recall correctly, hash indexes are extremely memory consuming. Quoting dev.mysql.com, "If a table fits almost entirely in main memory, the fastest way to perform queries on it is to use hash indexes". It is very hard to keep a hash and perform searches using it on a huge table in real-life conditions. If I am wrong, please advice me. I would obviously be interested in a O(1) search but that just feels very hard to achieve with regular computing resources on a multi-million records table.
On the other hand, the regular, more memory-friendly tree index implementation has, as you said, O(log n) complexity. Unfortunately, despite the "log" generosity, "n" can become a problem for big tables. Again, if I recall correctly, "n" being the number of levels in the tree, it is directly proportional with the number of records in the table. All I am trying to do is cutting this "n" down, to a much more smaller value, using in a way a pre-indexing hash-based method prior to the search itself.
FeelLikeANut
02-04-2008, 09:43 PM
If I recall correctly, hash indexes are extremely memory consuming. Quoting dev.mysql.com, "If a table fits almost entirely in main memory, the fastest way to perform queries on it is to use hash indexes".Well, as the quote says, keeping it in memory is the fastest way, but that doesn't mean it's the only way. A hash can be maintained on disk just as well as in memory, albeit slower because of disk accesses.
I would obviously be interested in a O(1) search but that just feels very hard to achieve with regular computing resources on a multi-million records table.If you really want to see how well it works, test. I'm sure your company or organization must have a sandbox (non-production) environment where you can play around with things. Run some SQL queries on unique or otherwise indexed values (a primary key is sure to be indexed), and benchmark the process. Then implement your split tables idea, and benchmark the same query.
if I recall correctly, "n" being the number of levels in the tree...The number of levels in the tree is actually log n.
All I am trying to do is cutting this "n" down, to a much more smaller value, using in a way a pre-indexing hash-based method prior to the search itself.The question is whether this pre-indexing implementation will be faster than the database's native indexing implementation. B-trees are balanced, so after the first comparison, the number of keys to check is already cut in half. And if the database is using a hash rather than a tree, then it almost certainly will be faster than a high-level pre-indexing algorithm.
Or at least that's what I and others here seem to think. Benchmarking some trials should give you a more definitive answer.