"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" DELETING DUPLICATE ROWS BASED ON A SUBSET OF COLUMNS """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" DATE : 9/87 NUMBER : 87-9-1 PRODUCT : R:BASE SYSTEM V VERSIONS : ALL CATEGORY : DELETE SUBCATEGORY : DUPES/MAINTENANCE """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" NEED: """" I need to delete duplicate rows in my customer table (CUST) based on a combination of only three columns (IDNO, LASTNAME, and ZIP) rather than all 20 columns in the table. The DELETE DUPLICATE command looks at all the columns. This does not work for me because some of the other columns may not be duplicated. SOLUTION: """""""" Here are three methods you can use to delete duplicates based on a subset of columns. Choose the one that best fits your needs. The first method (Print Report Then Load with FileGateway) is the fastest. It also allows you to set up criteria, such as "keep the most recent row," to tell R:BASE which of the duplicates to keep. This method prints all the data into a file with an R:BASE report, so you must be able to fit each of your rows onto a 255-character, fixed- format line. The second method (Unload Then Load with Rules) also allows you to set up some criteria to determine which row to keep. This method takes longer than the first method because it requires that you add a computed column to the table with the EXPAND command, and it checks rules during the load process. However, there is no limit to the length of your row. If your rows will not fit into a 255-character, fixed-format line, and you need to keep only the latest row, use this second method. The third method (APPEND with RULES) does not require the creation of a file. Instead, it uses RULES to prevent duplicates from getting into the data during an APPEND. Use this third method only if it does not matter which duplicate is kept. Methods two and three use the EXPAND and REMOVE COLUMN commands; make sure you have enough disk space. METHOD ONE: PRINT REPORT THEN LOAD WITH FILEGATEWAY This is the fastest method. Use this method if your rows can be printed in 255 characters or less, you have 10 or less columns in the subset, and you have enough room on your disk to hold a file that contains all the nonduplicate rows in the table. This method uses a MICRORIM ONLINE October 1987 --------------------------- Page 1 report which breaks on all the columns in the subset as a filter so that only one row from each set of duplicates is printed into a file. Then, the original table is emptied and FileGateway is used to load the formatted duplicate-free file into the now-empty original table. A report can contain up to 10 breakpoints so you can use this method if your subset of columns is 10 or less. 1. Backup the database. You need to have a good backup on hand because this method has you delete all the rows in your table. If anything goes wrong, you want to be able to recover. 2. Define a one-line report based on your table. Choose Configure and set lines per page to zero. Enter the first column in your subset of columns as the first break column, the second column as the second break column, the third as the third break, and so on for all columns in the subset. Next, press [ESC] to return to the main Reports EXPRESS menu and choose Edit. Locate all the columns in the original table in order (left to right) on the top line of the report and mark the line as a break footer section for the last break. For example, if there are three columns in the subset (and three breaks), mark the single line as an F3 section; if there are nine columns, mark the single line as an F9 section. By breaking on all the columns in the subset, and only printing the footer, you guarantee the printing of only one of the duplicates. 3. Use the following commands to print all the non-duplicate rows from the table into a file and then delete all the original rows. SET NULL -0- OUTPUT nondupe.dat PRINT rptname OUTPUT SCREEN *( Now delete all the rows ) DELETE ROWS FROM orig WHERE LIMIT = 999999999 *( Now, regain the space by using the RELOAD or PACK commands ) *( Use FileGateway to load the fixed-format file into ORIG ) If you want to ensure that the row that is printed is the latest row, add a SORTED BY clause to the PRINT command. For example, if you are storing the date in a DATE column named INDATE and you want to make sure that the row printed into the file is the latest row, sort the data so that the last row in every duplicate group is the latest row. The last row in the duplicate group is printed in the break footer. If you sort by INDATE, you need to also include all the break columns in the SORTED BY clause. You can only have a maximum of 10 sort columns so if you need to sort by a DATE column you can only fit a maximum of nine columns in the subset. For example, if you have five breaks, revise the PRINT command to read as follows: PRINT rptname SORTED BY break1 break2 break3 break4 break5 indate where BREAK1, BREAK2, BREAK3, etc. are the names of the break columns. 4. Use FileGateway to load the fixed-format file (NONDUPE.DAT) into the now empty original table. Set rule off during the load process, MICRORIM ONLINE October 1987 --------------------------- Page 2 they are not needed. METHOD TWO: UNLOAD THEN LOAD WITH RULES The first four steps for the second and third methods are the same: create a good backup of your database, add a computed column to the original table that is the concatenation of all the columns in the subset, make an empty table (EMPTY) that has all the columns from the original table (ORIG) by using the PROJECT command with the WHERE LIMIT = 0 clause, and set up a rule to prevent duplicate entries into the EMPTY table. In both methods the empty table ends up loaded with the only the nonduplicates. The original table is then removed and the no-longer-empty table is renamed to the original name. Use this method if your rows are longer than 255 characters and it matters which duplicate row is kept. This method unloads the data into a file making sure that the row you want to keep is the first row in every duplicate group. Then, it uses a rule that allows only the first row back in. For example, the method outlined below keeps the most recent row, based on a DATE column (INDATE), by unloading the data sorted in descending order and then loading it back in with a rule. 1. Backup the database. During this process you will be modifying the database structure and removing an entire table from the database. If anything goes wrong, you want to be able to recover. 2. Use the EXPAND command to add a computed TEXT column (CHECKCOL) to the original table (ORIG) that is the result of concatenating all the columns in the subset together. This computed column will be used to prevent duplicates from being loaded to the empty table you will create in step three. Be sure that you have enough available disk space for R:BASE to make a copy of your original table during the EXPAND process. After the EXPAND process is complete, you may want to use the RELOAD or PACK command to regain disk space. If some of the columns in the subset are not TEXT, use the CTXT function, within the computed column expression, to convert nonTEXT items to TEXT for the concatenation. Concatenation only works with TEXT. For example, if the columns in your subset are IDNO (INTEGER), LASTNAME (TEXT 20), and ZIP (TEXT 10) and the longest IDNO in your table is six characters, use the following EXPAND command to add the computed column (CHECKCOL): EXPAND orig WITH checkcol=((CTXT(idno))+lastname+zip) TEXT 36 If IDNO had been TEXT 6 to begin with, the EXPAND command would have been: EXPAND orig WITH checkcol=(idno+lastname+zip) TEXT 36 The number on the end is obtained by adding up all the lengths of the subset columns (6 + 20 + 10). 3. Create an empty table (EMPTY) that has the same structure as the original table (ORIG), change the definition of CHECKCOL in EMPTY from a computed TEXT column to a normal TEXT column, and key it: MICRORIM ONLINE October 1987 --------------------------- Page 3 PROJECT empty FROM orig USING ALL WHERE LIMIT = 0 REDEFINE checkcol TO TEXT IN empty 4. Define a rule to prevent duplicate CHECKCOL values from being loaded to EMPTY with these commands: DEFINE RULES " " checkcol IN empty NEA checkcol IN empty END 5. Use the commands listed below to unload the data from ORIG (the original table) into a file in descending order by a DATE column (INDATE), load EMPTY with the file, remove ORIG, rename EMPTY to ORIG, remove the extra column (CHECKCOL), and pack the database. Note that this method unloads all the data to a disk file and uses the REMOVE COLUMN command. Both of these processes use disk space. Make sure you have enough room on your disk. OUTPUT temp.dat UNLOAD DATA FOR orig AS ASCII SORTED BY indate=D OUTPUT SCREEN SET MESSAGES OFF SET ERROR MESSAGES OFF SET BELL OFF SET RULES ON LOAD empty FROM temp.dat AS ASCII SET MESSAGES ON SET ERROR MESSAGES ON SET BELL ON REMOVE TABLE orig RENAME TABLE empty TO orig REMOVE COLUMN checkcol FROM orig ERASE temp.dat PACK While loading the data, R:BASE checks the rules and adds only the first row of each set of duplicates, to the EMPTY table. METHOD THREE: APPEND WITH RULES Use this method if it does not matter which of the duplicate rows is kept. First, repeat steps one through four from method two, then continue with this step five: 5. Use the following commands to append ORIG (the original table) to EMPTY, remove ORIG, rename EMPTY to ORIG, remove the extra column (CHECKCOL), and pack the database: SET MESSAGES OFF SET ERROR MESSAGES OFF SET BELL OFF SET RULES ON APPEND orig TO empty MICRORIM ONLINE October 1987 --------------------------- Page 4 SET MESSAGES ON SET ERROR MESSAGES ON SET BELL ON REMOVE TABLE orig RENAME TABLE empty TO orig REMOVE COLUMN checkcol FROM orig PACK During the append process, R:BASE checks the rules and adds only the first row, in a set of duplicates, to the EMPTY table. MICRORIM ONLINE October 1987 --------------------------- Page 5