======================================================================= SEARCHING TEXT COLUMNS ======================================================================= PRODUCT: R:BASE VERSION : 3.1A Or Higher ======================================================================= AREA : PROGRAMMING CATEGORY: MANIPULATION DOCUMENT#: 665 ======================================================================= One function useful in almost every database is the ability to search on a text column, for example, lastname. R:BASE provides a number of different methods to do this. In this edition of the Exchange, an article by Royal Harris, "Putting IHASH To Work", describes a method that worked well for him, but may not suit your database or application. How do you decide what method is best for you? There are a number of factors to consider, the two most important being ease of use and performance. Ease of Use ----------- Ease of use is very subjective. What appears easy to one person may appear incredibly convoluted and complex to another. This is an area best left to the individual developer's discretion and is not addressed in this article. The article touches on the amount of custom programming that may be required to implement a particular method, and adding any needed columns to the database. Performance ----------- Proper use of indexing is vital to performance. You'll want to review information on indexing from the Command Dictionary entry on Indexes, the Supercharging R:BASE booklet, and articles in the May/June 1992 and July/August 1991 issues of the Exchange. The single most important factor in determining the effectiveness of an index is the uniqueness of index values. When R:BASE builds an index, it builds an entry for each unique value. As it goes through the data, if it finds two rows of data with the same index value it builds a "multiple occurrence table" (MOT) for that index value. A unique index value is faster than a value with multiple index occurrences. You can have multiple occurrences of index values if you have more than one row with the same data value in a column, or if you have a computed column (such as IHASH) whose values share the same result. Note that we are talking about unique index values, not data values. An index value may or may not be the same as a data value. If your column datatype is INTEGER or TEXT 4, for example, your index value and data value will be the same. If your column is TEXT 16, then your data value must be converted (hashed) into a 4-byte index value. Long text columns that are indexed, then, have a higher probability of unique data values creating non-unique index values. RSCOPE can be used to check indexes and see how many duplicate (multiple ocurrence) versus unique entries there are, and how many times R:BASE must read the index file to find the value. An index loses its effectiveness as duplicate values increase and R:BASE must make more reads and comparisons. Indexed columns can also affect performance when adding or changing data in a table. Indexed columns must be updated when a row is added or when the index column value is changed. The number of indexed columns in a table affects the speed with which rows are added or changed in the table. Take this into account when adding special columns for searching. It is faster to load and change data on a table with one indexed column than on a table with seven indexed columns. A computed column must also be updated when a new row is added to the table or when the data on which it is based changes. This is important when deciding how many, if any, computed columns you want to add to your database for searching purposes. Incorporating indexed computed columns for searching slows the process of adding or changing data in the table. Methods of searching -------------------- A number of different searching methods are described below with discussion of the pros and cons of each method. This should help you pick a text searching method or combination of methods that fits your needs. All the examples use a table named EMPLOYEE and a column named LASTNAME. Using the text column only -------------------------- This is the most straightforward method. Simply index the LASTNAME column and use the where clause, WHERE LASTNAME = value, for searching. The equals operator (=) will use indexes. Using other operators on a TEXT column with a defined length greater than 4 will not use indexes. The advantage of this method is it's simplicity: no columns to add to the database, no extra programming, and users can easily do ad hoc queries through the R:BASE main menu. On a LASTNAME column, you can have a fairly effective index even though the data values must be hashed to a 4-byte index value. You'll have some duplicate index values but names are fairly well distributed throughout the alphabet. On a sample 42,000 row table, performance was comparable to all other methods when searching for a unique name or a common name (such as Johnson) that occurs many times. You will, of course, need to test this on your data. What this method doesn't provide is options. You need to enter the complete lastname and you need to enter it exactly; if you leave off a letter or misspell it, nothing is found. If you don't find the person did you misspell the name or are they really not there? Also, if you are searching on a text column other than LASTNAME (for example, PART_NUMBER) you may find that just indexing the column is not efficient and you'll want to use an alternative method. Computed column using IHASH --------------------------- R:BASE includes a function that can be used to create an integer value from a text value. The function was designed to create effective integer keys from long text columns. The function, IHASH, converts the entire text value, or just a specified number of characters. Using this method is more complex that just indexing the LASTNAME column. First you need to add a computed column to your table using the IHASH function on the LASTNAME column to convert its text to integer values. You can modify your table through the Create/modify option from the Info pulldown on the R:BASE main menu or use the ALTER TABLE command: ALTER TABLE employee ADD Hash_Lname=(IHASH(lastname,0)) INTEGER The IHASH function converts the entire name to integer when used with the parameter 0. A different parameter converts the specified number of characters from the name, starting at the first character. For example, the parameter 7 will convert the first 7 characters of the lastname to an integer value. Deciding on the number of characters to convert can be one of the hardest things about using this method. Consider the relationships expressed in the following chart: | Convert FEW characters | Convert MORE characters -------+---------------------------+-------------------------- PROS | less input required | less duplicate values -------+---------------------------+-------------------------- CONS | greater duplicate values | more input required -------+---------------------------+-------------------------- After adding the computed column to your table, you need to use some programming commands as shown below to query that column. Using the IHASH function directly in a WHERE clause won't use indexes. First set a variable equal to IHASH of the value you're searching for, then use the variable in the WHERE clause. When using an IHASH column for searching, you won't be able to do ad hoc queries from the R:BASE main menu . SET VAR vname = (IHASH('Smith',0)) SEL * FROM employee WHERE Hash_Lname = .vname This method does provide greater flexibility in that you can have users enter anywhere from 1 character to the entire name based on the number of characters you specify in the IHASH function. For example, add a computed column to the table that will IHASH the first four characters of the name. Then, in your program, check the length that the user enters and if it's greater than four characters use an extra condition on your WHERE clause. FILLIN vname USING 'Enter lastname (at least 4 characters): ' SET VAR vlen1=(SLEN(.vname)),vname1=(SGET(.VNAME,4,1)),+ vhash=(IHASH(.vname1,4)) IF vlen1 > 4 THEN CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) + FROM employee WHERE Hash_Lname=.vhash AND + (SGET(lastname,.vlen1,1))=.vname ELSE CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) + FROM employee WHERE Hash_Lname=.vhash ENDIF A user can enter any number of letters for use with an IHASH computation, but must enter at least as many characters as specified in the IHASH column definition or enter the full name. If the entry less than the specified number of characters and less than the full length of the name, the correct data is not found. For example, with a column defined as (IHASH(lastname,7)), entering "WILL" will not find "WILLIAMS", it will only find "WILL". The advantages of using a computed column with the IHASH function are that you can turn an inefficient TEXT index into an efficient INTEGER index and you can provide flexibility in searching. Users will have a larger selection of names to choose from and can select the appropriate person from the list. For example, entering WILLIAM will find WILLIAMSON, WILLIAM, and WILLIAMS if you use IHASH(lastname,7). A disadvantage of this method is that you need to add columns to your database. An extra computed column can slow down data entry. If you are tight on disk space this may not be an option. To determine how much additional disk space you'll need for an IHASH column, take the number of rows in the table and multiply by 4. The answer is the number of bytes of disk space you'll need for the additional column. Use the SOUNDSLIKE operator --------------------------- R:BASE has an undocumented WHERE clause operator,SOUNDSLIKE, that can by used to search text columns. It uses the Soundex algorithm, the same one that is used by the U.S. Census Bureau to store census information. Like its name implies, the SOUNDSLIKE operator uses the Soundex algorithm to find similar sounding names. The Soundex algorithm keeps the first character of the name and then converts the next 3 consonants to numbers using the following chart. Vowels are ignored, consonants W, Y and H are ignored, double consonants are treated as one, and side-by-side letters with the same number count as one. Empty spaces at the end of the code are filled with zeros. the number represents the letters ---------- ---------------------- 1 B P F V 2 C S K G J Q X Z 3 D T 4 L 5 M N 6 R For example, the name GRAHAM converts to the code G650. "G" for the first letter, 6 for the "R", the "AHA" is ignored, 5 for the "M" and then 0 to fill the code out to 4 characters. WILLIAM will convert to the code W450, WILLIAMS to the code W452. You can use the SOUNDSLIKE operator directly in a WHERE clause: BROWSE * FROM employee WHERE lastname SOUNDSLIKE 'Graham' In addition to rows where the lastname equals Graham, this condition would also return rows for lastnames Grahm, Grim, Grimm, Green, Greene, Groome, Gorin, Gahring and many others. But SOUNDSLIKE "William" will not return "Williams" because the code is different. The SOUNDSLIKE operator provides a different way of searching. You don't need to spell the name exactly and it finds many similar sounding names. But, as it doesn't use indexes, on a large table it can be slow. Computed column using SOUNDEX ----------------------------- You don't need to use the SOUNDSLIKE operator to take advantage of the Soundex code. With R:BASE 3.1C or higher you can use a UDF that calculates and returns the Soundex code. Microrim includes a Soundex UDF with UDF pack #3, available from our Sales Department at 800-628-6990. To use the UDF, create a computed column with the expression (UDF('SOUNDEX',lastname)) and datatype TEXT 4. Index the column. As with the IHASH function, the UDF function can't be used directly in a WHERE clause. You need to first convert the value then use the converted value in the WHERE clause so it will use indexes. For example, FILLIN vname1 USING 'Enter lastname: ' SET VAR vname2=(UDF('SOUNDEX',.vname)) CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) + FROM employee WHERE scode=.vname2 As with the SOUNDSLIKE operator, exact spelling is not required, but you may get a large number of names to choose from. You can restrict the number of names by also matching on part of the lastname entered. For example, CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) + FROM employee WHERE scode=.vname2 AND + (SGET(lastname,4,1))=(SGET(.vname1,4,1)) This WHERE clause requires that the Soundex code match AND the first four characters of the lastname match. This searching method's drawback is having to add a computed column using the UDF function. Slower processing results because R:BASE must ZIP out to execute the UDF program for each row. An advantage is that you don't need to worry about spelling names exactly. Users can choose from a list of similar sounding names. Computed column using SGET -------------------------- Another way to use a computed column to create an efficient index from a long text field is to create a TEXT 4 column that can be indexed and used for searching. For example, it can be a computed column defined as (SGET(lastname,4,1)). This takes the first four characters of the lastname and places them into another column. Index this column and use it for searching instead of the LASTNAME column. A TEXT 4 column does not need to be hashed when it is indexed so the index value equals the data value. However, this is not necessarily a very unique index, particularly on names. What this method does provide is the ability to use the BETWEEN operator on a text search. Both IHASH and SOUNDEX convert the name to an indexed value that can be used with the BETWEEN operator. Because they use a conversion process, however, names might convert to the same code but are not close together in the alphabet, for example, Graham and Greene. The SGET takes the first four characters of the lastname column and does not convert it. You can then use it with the BETWEEN operator to find a range of names. The command CHOOSE vchoice FROM #VALUES FOR (firstname & lastname) + FROM employee WHERE first_four BETWEEN 'John' and 'John' will quickly find all names that begin with 'John'. Wayne's Corner in this issue of the Exchange shows another example of using BETWEEN to search for a range of values. Computed column using an expression ----------------------------------- You may want to define your own expression to create an identifier for people. You could use the first 4 characters of the lastname plus the firstname initial, (IHASH(lastname,4)+IHASH(firstname,1)), or a lastname and zipcode combination, (lastname+zipcode) or (IHASH(lastname,0)+INT(zipcode)). There are many ways to combine/convert names into codes to use for searching. Just make sure you are consistent. If you use the zip code, always use just the 5-digit zip code. Don't mix and match 5-digit and 10-digit zip codes. You will want to look at the points raised with the other methods discussed here, and decide if it will be an effective indexed column for performance and will it be easy for your users. Compare the methods ------------------- By setting up a sample database with different columns and using different searching methods you can compare the methods. Be sure your sample database has enough data to provide relevant results. For example, a table with 100 or 1000 rows, will not produce comparable results if your production table has 100,000 rows of data. You will probably find that no one method is best for all names. For example, the Soundex code works fast for Gillespie but not for Graham; it's over twice as slow because many more names are found. Your data and the way you use it will determine the method(s) you choose. Perhaps you'll decide to provide users with two or more different searching methods. The following sample program allows users to search using the lastname (indexed LASTNAME column), the company name (indexed computed column, (IHASH(company,10)) ), or the Soundex code of the lastname (indexed computed column, (UDF('SOUNDEX",lastname)) ). The program can be used as an Entry/Exit procedure in a form. It requires a separate ASCII menu file for the search options. *(SEARCH.RMD) CLS CHOOSE vc FROM choice.mnu AT CENTER,CENTER CLEAR SWITCH (.vc) CASE 'Lastname' CLE VAR vans,vkey,vans1 DIALOG 'Enter lastname to search by ' vans,vkey,1 AT 12 IF vkey = '[Esc]' THEN BREAK ENDIF CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust + WHERE lastname=.vans ORDER BY lastname,firsname AT CENTER,CENTER + FOOTING BREAK CASE 'Company' CLE VAR vans,vkey,vans1 DIALOG 'Enter company to search for ' vans,vkey,1 AT 12 IF vkey = '[Esc]' THEN BREAK ENDIF SET VAR vans1 = (IHASH(.vans,7)) CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust + WHERE company_hash = .vans1 ORDER BY lastname,firsname AT CENTER,+ CENTER FOOTING BREAK CASE 'Soundex code' CLE VAR vans,vkey,vans1 DIALOG 'Enter lastname to search by using Soundex' vans,vkey,1 AT 12 IF vkey = '[Esc]' THEN BREAK ENDIF SET VAR vans1 = (UDF('SOUNDEX',.vans)) CHOOSE vname FROM #VALUES FOR (lastname+','&firsname ) FROM newcust + WHERE scode=.vans1 ORDER BY lastname,firsname AT CENTER,CENTER + FOOTING BREAK CASE default CLE VAR vans,vkey,vans1 BREAK ENDSW RETURN ASCII menu file: choice popup |Choose searching method| |Lastname| |Company | |Soundex code|