I have struggled for months with this one. I originally placed all of the Spanish into the database with the accent characters in HTML-entity form.
For example:
Code:
¡Me encanta la fotografía!
This would correctly display the characters on screen, but the word "fotografía" could not be matched in a select query (with the possible exception of actually typing the HTML-entity).
I wanted everything to be in UTF8 format, so I then tried converting the entire database to this, replacing all of the HTML-entities with their UTF8 equivalents. Now I can search, and find, the Spanish accented words, but ONLY if the accents are entered into the query.
For example, if the word is "Jesús", the un-accented form "Jesus" will not match, but "Jesús" will.
I have heard there is a way to ignore accents, similar to ignoring case.
How can I properly set this up?
I _VERY MUCH_ appreciate any helpful hints on this.
I've been using queries, via my perl script, of this form:
Code:
SELECT * from Database.Table WHERE TextColumn LIKE "%keyword%" LIMIT 0, 10;
Part of my issue here is that the script needs to handle multiple languages. I have multiple tables in the database, one for each language: English, Spanish, Chinese, Thai, etc. Unfortunately, I don't think I can use a FULLTEXT search on the Asian languages. Also, in order to handle multiple languages equally, the database is setup to do everything in UTF8.
So, is this accent-insensitive search only going to be available when using fulltext? Is there a way of making it work with an ordinary LIKE statement, or perhaps with RLIKE?
Spanish is the only database I'm having issues with right now, and I'm hesitant to make major changes to the search queries when all of the other languages are operating smoothly.
As I understand, using a FULLTEXT search is not all that similar to doing a LIKE search in several aspects: 1) Ordering of results, 2) Wordlengths, and 3) System requirements.
I'm still looking for an answer to this. I've asked a few of my expert friends, who said they'd get back to me, but never did...so I presume they did not find the answer either.
There are a few other reasons why I am not doing a fulltext search, and thus cannot consider it as a "best option" in my case. Is there no way to do an accent-independent search in MySQL without fulltext?
Most of the databases today have the "Collation" option in which you define your database collation in certain "language oriented", Case Sensitive/Insensitive and Accent Sensitive/Insensitive.
Defining your database (all), or only a table with a desired collation will allow you to query data without having to deal with accents or case. Example:
My database is MS SQL 2005 and the collation of my database is defined as:
Latin1_General_CI_AI
This definition is made at the time of database creation, or later your can change properties of database or only a table with an specific collation.
Latin1_General_CI_AI collation gives you the chance to have latin caracters and include most of the latin and european languages (excluding japanese, russian and some others with very funky characters).
CI means Case Insensitive and AI means Accents Insensitive
When creating a database or changing the collation you are given multiple options of Latin or other character sets.
To change the collation of a table, set Extended Properties of the table and change the collation.
In case of MS SQL, ther is no way to define specifically UTF-8 set since database is multi-character capable (that says Microsoft). However, if you store the data from a page defined as utf-8, data will be saved into table as UTF-8 and when retrieved will be utf-8 as well. This is a tricky thing since data can be utf-8 but retrieved in a page with a different charset, SQL will convert.
FYI: ASP language on IIS does not handle utf-8 well, sometimes display data as charset=iso-8859-1 or so, so funny chars are shown. MS says utf-8 is not supported. You can use it but half of the time will sitch data to funny chars.
ASPX is other stoy, is fully functional with UTF-8 as well as php and other languages.
If you use Postgre or MYSQL database, both have this collation types witha case and accent sensitive/insensitive and full definition for utf-8.
In short, defining collation with AI and CI collation, you only need to query your database and accents and case will not be taken in consideration, this means user can type something with or without accent. Example:
User type: ídea and results come with records for: Idea, idea, ídea
Are you sure that MySQL supports these *_AI character sets?
To be frankly honest, I'm taking a hard look at non-MySQL solutions right now, because MySQL is behind the times on the basics like this. Another feature I need that it lacks is that of word proximity searching.
I'm just tired of dealing with this. I have only one possible solution, it appears, with MySQL: create a column for searching in which ALL of the accents have been replaced with non-accented characters, and then return the search results from a separate column where the original text has been preserved. But that means double the data space, and more mess to deal with.
Thank you for taking the time to share the possibility. I just haven't seen it be a reality for MySQL.
Bookmarks