Reconciling assholes for nearly a decade.

Corpus Cristi -or- Database Searching

So there's this table with about a million records, and a dozen nvarchar or ntext fields. A user wants the ability to enter any text snippet and have any record with that value anywhere in any part of the values of any of those columns as a result. This text snippet isn't necessarily a whole word, nor is it necessarily the beginning of the word (e.g. they want 05025 to pull up the record where the guy mentions record BG20050910052576), which rules out Full-Text Search being useful (in SQL Server). One is relegated to doing terrible, full table scanning LIKE comparisons with every column of every record. Terrible.

In that scenario, does anyone know of any technological solutions? Any indexers of partial word fragments that one should look at in this scenarios? I'm going to look at the Lucerne engine, but my gut instinct and pedestrian knowledge of full-text indexing tells me that there's going to be little consolation.

Oh, and no I didn't design this.
Permalink Dennis Forbes 
August 30th, 2005
Lucene would have been my suggestion. It think it does support pattern searches, not just keyword, so you should be ok.
Permalink son of parnas 
August 30th, 2005
Lotus Notes full text search has wildcard operators, and it used the Verity search engine. http://www.verity.com/
Permalink ronk! 
August 30th, 2005
Add a text field, call it "fulltext" or something.
Add a trigger on the record, set to trigger on inserts or updates. In the trigger concatenate the fields into the fulltext field.

Run your queries on the fulltext field.

Philo
Permalink Philo 
August 30th, 2005
I'm not sure fulltext indexes would cover this case:

"they want 05025 to pull up the record where the guy mentions record BG20050910052576"

Do they?
Permalink Almost H. Anonymous 
August 30th, 2005
RTFP, you idiot, Philo.
Permalink Anon 
August 30th, 2005
Wow. Talk about thick-headed.

You last two, I mean.

I'm sorry I arbitrarily picked the term "fulltext" - call it "ConcatText"

This has nothing to do with full text searching - I'm talking about creating a field which has all the fields of interest that you can then run the query (even if it has to be a LIKE '%' + SEARCHTERM query) against.

Philo
Permalink Philo 
August 30th, 2005
>> So there's this table with about a million records

Again, RTFP, you idiot, Philo.
Permalink Anon 
August 30th, 2005
>>>> So there's this table with about a million records
>>Again, RTFP, you idiot, Philo.

So what? 1 million records is not a lot for an indexed table.
Permalink Masiosare 
August 30th, 2005
Like %Searchterm doesn't use indexes.

But then, I'm noting that "Anon" isn't coming up with anything better.

I have to wonder if he sits in team meetings and every time someone suggests a solution he says "No, that's stupid and won't work, you fucking moron"

Philo
Permalink Philo 
August 30th, 2005
Dennis, you in a position to give it a try and let us know what the response time is like?

Philo
Permalink Philo 
August 30th, 2005
>> I have to wonder if he sits in team meetings and every time someone suggests a solution he says "No, that's stupid and won't work, you fucking moron"

Hey you must be boss, 'cause that's exactly what i do! ...you fucking moron!!
Permalink Anon 
August 30th, 2005
>> Dennis, you in a position to give it a try and let us know what the response time is like?

No, that's a stupid idea and won't work, you FUCKING MORON!!
Permalink Anon 
August 30th, 2005
+1 to Philo, which I guess makes me a moron.

Messy problems don't always have elegant solutions.
Permalink Christopher Hester 
August 30th, 2005
> Messy problems don't always have elegant solutions.

Messy problems ALWAYS have elegant solutions, you MORON!!!


> +1 to Philo

Philo isn't in the right mathematical domain to be incremented by 1 (unless maybe you're a Perl programmer), you MORON!!!

;P
Permalink Tayssir John Gabbour 
August 30th, 2005
i like philo's solution, too.

+1 moron here.
Permalink Kenny 
August 30th, 2005
BTW, this is a beautiful example of justification for triggers. I *hate* triggers - I avoid them like the plague; but I can't think of a better way to do this.

Philo
Permalink Philo 
August 30th, 2005
I hate hate hate triggers too. How much more expensive is this?

SELECT
...
WHERE
COLUMN1 LIKE '%' + @TERM + '%' OR
COLUMN2 LIKE '%' + @TERM + '%'
Permalink Jeff Barton 
August 30th, 2005
Philo's suggestion is a good one. Doing the "like" search on one column will be better than all the other columns individually.

One million rows is a trivial amount for SQL Server, assuming you're not running on a Pocket PC ;-)

Only optimization would be to split the concatenated field into a separate table, so that the search query wouldn't have so much data to trawl through. With the column in the main table, SQL will have to table scan about twice as much data (as the data is stored twice).

These extra I/Os may be material, but with only one million records, it shouldn't be an issue.
Permalink New York ??? 
August 31st, 2005
> does anyone know of any technological solutions?

A hammer. Use it to beat the user with it.
Permalink  
August 31st, 2005
"One million rows is a trivial amount for SQL Server, assuming you're not running on a Pocket PC ;-)"

If we were talking about an index-backed operation, then I'd agree with you - a million rows isn't a big deal.

However, when you're dealing with rows with mixed nvarchar and ntext columns, with an average row size (including the ntext) of about 3KB+, then a million rows is a _HUGE_ amount. Table scanning, and comparing against, 3GB continually is a non-trivial task even on very large scale servers. Even if it were smaller, it would still be a bitch if you had a heavily loaded server and then someone comes along to table scan a million rows. Something like this gives me the willies at the outset, and I get upset and in tithers if I see table scans of multi-hundred record tables, much less million record tables.
Permalink Dennis Forbes 
August 31st, 2005
But Dennis, based on the problem as stated, the ONLY solution is to scan the data - I don't *think* there's any kind of magic bullet that will perform fuzzy substring searches any other way?

Philo
Permalink Philo 
August 31st, 2005
That's pretty much what I'm trying to determine, Philo. Presumably computer science is advancing, and if there is a form of indexing that would allow partial mid-string matches (for instance an indexer could MASSIVELY trade storage for speed and store partial fragments of every word. e.g Philo is indexed against Philo, hilo, ilo, lo, and o), then I'd like to know before I declare it an impossible to optimize task.

As a partial theoretical solution, end of string matches (e.g. "REC200508310059" could be matched by 0059) should be a configurable in full-text indexing where it indexes each "word" both forwards and backwards, and a CONTAINS against *0059 would reverse the search phrase and search on the reverse set. Unfortunately full-text indexing doesn't allow this (apart from someone hacking out a reversed column for each of their fields and indexing that as well, which is the sort of nonsense manual task that is hacksville).
Permalink Dennis Forbes 
August 31st, 2005
In my drunken stupor Philo's concatenated column seems like a reasonable answer -- the database design is obviously borken from the off, so a crude btu funtional hack is probably the best you can hope for. I've dealt with similarly retarded databases before (even so far as to the requirement being searching any field in any table) and the answer has always been ugly...
Permalink Mat Hall 
August 31st, 2005
BTW, thought about this on the flight home - stick the concat field in another table (as suggested) and design an async process to do the search, returning batches of results, but cancellable by the user.

This solution is really appealing if the requirement is non-negotiable and you're paid by the hour...

Philo
Permalink Philo 
September 1st, 2005

This topic was orginally posted to the off-topic forum of the
Joel on Software discussion board.

Other topics: August, 2005 Other topics: August, 2005 Recent topics Recent topics