Handling Duplicate Data You print the month-end report and see two records for account #102. You've merged databases and need to check for duplicates. You have both John Smith and John C. Smith at the same address. Are they the same person? You have a record for John Smith with a 5-digit zipcode and a record for John Smith with a 10-digit zipcode. All of these situations involve what may be duplicate records. What can you do to find the records involved and delete the duplicates? Browsing the data is one way, but not if you have 10,000 records, and some situations require human interaction. For example, the computer cannot make the decision that John Smith and John C. Smith are the same person. The computer can, however, help you find the records involved and present them so that the data is easy to view and the necessary decisions can be made. The R:BASE DELETE DUPLICATES command deletes identical records from a table. The command searches and compares records character by character, and if two records are identical, will delete the second record. Because the command compares character by character, the command is not fast. And often, records are not exactly identical. Two spaces between words instead of one means records are not identical. An M instead of an N means records are not identical. Most situations involve data where human interaction is needed to say "Yes, these records are the same" or where the comparison needs to be made on one or two or three columns, not the entire row. In R:BASE 4.5 Plus! and higher, you can specify a column list with the DELETE DUPLICATES command to check for duplication on the specified columns only. For other situations, you need to find, and actually look at, the duplicated rows. Preventing Duplicates Use R:BASE constraints and data entry rules to prevent duplicate data from being entered. A primary key constraint requires that the specified column or columns be unique. You can also create a rule that requires data entered in a column to be unique. The primary key constraint or the rule is easily created through the menus. The menus prompt for the table and column name that must be unique. Another option is to define a unique index for a column. Each of these techniques ensures that duplicate data is not entered into a column. To require uniqueness across a group of columns, you define a multi-column primary key, a multi-column uniqueness rule, or a multi-column unique index. The multi-column primary key can be defined through the menu system in either R:BASE 4.5++ or R:BASE 5.1. The multi-column unique index can be defined through the menu system in R:BASE 5.1, but must be defined from the R> prompt in 4.5++. A multi-column uniqueness rule is defined through the custom rule option. The easiest way to define a multi-column uniqueness rule is to start with the default single column uniqueness rule. Create a single column uniqueness rule picking the appropriate table and one of the columns that is part of the unique group. Then select to modify the rule just created. Add to the WHERE clause comparison conditions for the other columns that make up the unique group. Follow the same structure as for the default rule. For example, here is the WHERE clause from a default rule requiring the employee last name to be unique: Employee.emplname IS NOT NULL AND Employee.emplname + NOT IN (SELECT emplname FROM Employee #T1 + WHERE #T1.emplname = Employee.emplname) Since last name is not unique be itself, the rule is modified to add the first name and address columns to the condition list. The three columns together, last name, first name, and address, make a unique set. Employee.emplname IS NOT NULL AND Employee.emplname + NOT IN (SELECT emplname FROM Employee #T1 + WHERE #T1.emplname = Employee.emplname + AND #T1.empfname = Employee.empfname AND + #T1.empaddress = Employee.empaddress) Finding Duplicates Ever wonder why a company sends you two or three pieces of the same mailing? All with slightly different names? In this age of computers why can't they figure out that you're all the same person? Well, the computer can't make the decision that Smith and Smith are the same person, only a human can do that. What the computer can do is help identify possible duplicate records. R:BASE's constraints and data entry rules can keep duplicate data from being entered, but constraints and rules don't prevent misspellings or miskeying of data. An SQL SELECT command is used to search data for possible duplicates. Notice that the name columns are not used in the SELECT command to check for duplicates. We want to search for possibly misspelled names so we don't use name as one of our criteria. SELECT empid, COUNT(*)....the column, empid, uniquely identifies a row FROM employee...... this is the table, Employee, to check for duplicates WHERE city IN............one of the columns used in the duplicate comparison (SELECT city...........a Sub-SELECT is used to create a comparison list FROM employee T1 GROUP BY state, city, address...group the data together by the columns used to check for duplicates HAVING COUNT(*) > 1)....... ...count the rows in the group, only return a result if the same state, city, and address occur more than once GROUP BY empid...........display the id and count only once for each id This SELECT command displays the employee ID and the number of times it's found with the same state, city and street address. If any records are found, there may be duplicate data where the name or the zipcode was misspelled or miskeyed. Now those records can be displayed for editing and someone can review them and decide if they are indeed duplicates and which one to keep. Keep the same WHERE clause, but replace the SELECT command with a command such as EDIT that allows viewing and deleting of the data: EDIT * FROM employee + WHERE city IN + (SELECT city + FROM employee T1 + GROUP BY state, city, address + HAVING COUNT(*) > 1) + ORDER BY state, city, address, lastname Add the ORDER BY clause so when the data is displayed, the possible duplicate records are next to each other. Depending on the size of the database, this can be a large task. But if you're serious about finding and removing duplicates, it needs to be done. Deleting duplicates Once the potential duplicates are found, a person can review the data and then delete the duplicate data. This way there is a choice over which record to keep. The computer can be programmed to do this task, but it will always keep the first record found. The easiest way to have the computer delete duplicates, keeping the first record, is to use the DELETE DUPLICATES command with the optional USING clause. The USING clause specifies the columns to check for uniqueness so the computer does not need to check the entire row. For example, the check for duplicate employee records you might use this command: DELETE DUPLICATES FROM employee + USING ELastName, EFirstName, EAddress R:BASE checks only the specified columns, ELastName, EFirstName, EAddress, for duplicate data making for much faster processing time than if R:BASE had to check the entire row. With the optional USING clause, the DELETE DUPLICATES command is the fastest way to delete duplicate rows of data from a table. This option is available in R:BASE version 4.5 Plus! and higher. An alternative method involves creating a temporary table and using a multi-column unique index or a multi-column uniqueness rule. This method works well for versions of R:BASE prior to R:BASE 4.5 Plus!. 1. Create a temporary table with the same structure as the table that has the duplicate records. PROJECT temp FROM employee USING ALL WHERE LIMIT=0 2. Create a multi-column uniqueness rule referencing the temporary table, Temp. Create an index on one of the columns to be compared, for example, lastname. This is important for performance. Or, create a multi-column unique index on the temporary table, Temp. When using a multi-column unique index, the columns used in the index must be specifically defined as NOT NULL. 3. Use the INSERT command to move the data from the original table to the temporary table. INSERT INTO temp SELECT * FROM employee As a row is added, the rule or index is checked. If a row already has been added, the rule or index check fails and the duplicate row is not loaded. There is no way to count how many rows fail or to see what those rows are. The first row is loaded and subsequent rows are automatically thrown out. The table Temp then contains only one row with a particular last name, first name, street address combination, for example. Not really duplicates Often duplicate data printed in a report is not really duplicated. The report may be based on a view and the view joins the tables in such a way that the duplicate data is created in the view but does not exist in the underlying tables. Duplicate data can appear in a view any time tables that have a many-to-many relationship are joined. Some of the columns are always duplicated when tables with a one-to- many relationship are joined. Look at an example from the CONCOMP sample database. The Salesbonus table has many rows for each employee. The column empid is used to link with other tables, but does not uniquely identify a row. empid transdate netamount bonuspct bonus ------ --------- ----------- -------- --------------- 102 01/12/95 $176000.00 0.003 $528.00 102 02/27/95 $87500.00 0.002 $175.00 102 02/28/95 $22500.00 0. $0.00 102 03/01/95 $40500.00 0.001 $40.50 129 01/10/95 $76800.00 0.002 $153.60 129 02/23/95 $36625.00 0.001 $36.63 129 03/07/95 $56250.00 0.002 $112.50 133 01/03/95 $27000.00 0.001 $27.00 160 01/09/95 $9500.00 0. $0.00 160 02/23/95 $210625.00 0.003 $631.88 The Employee table has one row for each employee. The empid column does uniquely identify a row. empid emptitle empfname emplname empaddr ----- -------------- -------- ---------- ------------------ 102 Manager June Wilson 3278 Summit Drive 129 Manager Ernest Hernandez 12390 Windermere Dr. 133 Representative Peter Coffin 4105 29th Ave N.E. 160 Representative Mary Simpson 101 West Mercer 165 Representative Darnell Williams 8806 88th Street 166 Sales Clerk John Chou 5001 Main Street 167 Representative Sandi Watson 1002 S. Front Ave. When the two tables are joined (Employee.empid=Salesbonus.empid), the result displays as many rows for each employee (empid) as there are in the Salesbonus table. The data selected from the Employee table is repeated, but the data from the Salesbonus table is unique. data from data from Employee Salesbonus /\ / \ empid emptitle empid transdate netamount ----- -------------- ----- --------- ----------- 102 Manager 102 01/12/95 $176000.00 102 Manager 102 02/27/95 $87500.00 102 Manager 102 02/28/95 $22500.00 102 Manager 129 01/10/95 $76800.00 129 Manager 129 02/23/95 $36625.00 129 Manager 129 03/07/95 $56250.00 133 Representative 133 01/03/95 $27000.00 160 Representative 160 01/09/95 $9500.00 160 Representative 160 02/23/95 $210625.00 | | | | +--------------+------+ | the data is linked together by the empid column One row is created in the view for each row in the many table (Salesbonus) that matches a row in the one table (Employee). What happens if this view is now linked with the Transmaster table? The Transmaster table also includes the column empid, but like the Salesbonus table, can have many rows for each employee. transid custid empid transdate netamount freight ------- ------- ------ --------- ----------- --------- 4760 100 133 01/03/95 $27000.00 $270.00 4780 105 160 09/09/95 $9500.00 $95.00 4790 104 129 01/10/95 $63000.00 $630.00 4795 101 102 01/12/95 $176000.00 $1760.00 4800 105 160 09/23/95 $167250.00 $1672.50 4865 102 129 02/23/95 $29125.00 $291.25 4975 101 102 02/27/95 $87500.00 $875.00 4980 101 102 09/28/95 $22500.00 $225.00 5000 101 102 03/01/95 $29000.00 $290.00 5050 104 129 09/07/95 $56250.00 $562.50 5060 101 102 03/08/95 $30000.00 $300.00 5065 106 160 09/14/95 $140300.00 $1403.00 5070 104 129 03/15/95 $95500.00 $955.00 5075 102 129 09/16/95 $155500.00 $1555.00 5080 100 133 03/20/95 $80000.00 $800.00 When the Employee, Salesbonus and Transmaster tables are joined by linking the empid column, the view contains many extra rows of data. Look at the data for employee #102 (empid=102), for example. There are four rows for employee #102 in the Employee, Salesbonus view. Each of these four rows joins with each of the five rows in the Transmaster table for employee #102, making 20 rows in the resulting three-table view. data from data from data from Employee Salesbonus Transmaster / \ / \ /\ empid emptitle transdate netamount transid custid empid transdate ----- -------- --------- ---------- ------- ------ ----- --------- 102 Manager 01/12/95 $176000.00 4795 101 102 01/12/95 102 Manager 02/27/95 $87500.00 4795 101 102 01/12/95 102 Manager 02/28/95 $22500.00 4795 101 102 01/12/95 102 Manager 03/01/95 $40500.00 4795 101 102 01/12/95 102 Manager 01/12/95 $176000.00 4975 101 102 02/27/95 102 Manager 02/27/95 $87500.00 4975 101 102 02/27/95 102 Manager 02/28/95 $22500.00 4975 101 102 02/27/95 102 Manager 03/01/95 $40500.00 4975 101 102 02/27/95 102 Manager 01/12/95 $176000.00 4980 101 102 09/28/95 102 Manager 02/27/95 $87500.00 4980 101 102 09/28/95 102 Manager 02/28/95 $22500.00 4980 101 102 09/28/95 102 Manager 03/01/95 $40500.00 4980 101 102 09/28/95 This makes a lot of duplicated data in our final view. Only a few of the columns are unique to a row. If we don't select all the columns, we may think we have duplicate rows. By using another column, transdate, to link the Transmaster table to the Salesbonus table, the view changes to a one-many-one relationship instead of a one-many- many. If you suspect there are extra rows in your view due to a many-to- many link, check it by selecting the data from the view and then from the underlying tables. With the above example, use the following commands to count the rows for a particular empid. The counts show that there are indeed extra rows in our view. R>SELECT COUNT(*) FROM employee WHERE empid=102 COUNT (*) ---------- 1 R>SELECT COUNT(*) FROM salesbonus WHERE empid=102 COUNT (*) ---------- 4 R>SELECT COUNT(*) FROM transmaster WHERE empid=102 COUNT (*) ---------- 5 R>SELECT COUNT(*) FROM bonus_view WHERE empid=102 COUNT (*) ---------- 20 It's much easier to detect this type of problem by picking a specific data value to test. How do you solve this? Look at your table structure. There may be another column that can used to link the tables to create a one relationship rather than a many relationship. Identify the columns that uniquely identify a row in each table. In the example above, the empid column does not uniquely identify a row in the Salesbonus table, rather both the empid and transdate columns must be used to uniquely identify a row. The transdate column can be used with the empid column to link the Salesbonus table with the Transmaster table removing the many-to-many relationship from the three table view. As with all SQL commands, when creating views it's best to work with a small sample of data so you can easily verify the results.