DOCUMENT #683 ======================================================================= SEEING DOUBLE ======================================================================= Product: R:BASE Version : 3.1 & Higher ======================================================================= Area : DATA MANIPULATION Category: DUPLICATES ======================================================================= It happens. You print the month-end report and see two records for account #102. You've merged databases and need to check for duplicates. You have both John Smith and John C. Smith at the same address. Are they the same person? You have a record for John Smith with a 5-digit zipcode and a record for John Smith with a 10-digit zipcode. All of these situations involve what may be duplicate records. What can you do to find the records involved and delete the duplicates? Browsing the data is one way, but not if you have 10,000 records. And some situations require human interaction. For example, the computer cannot make the decision that John Smith and John C. Smith are the same person. The computer can, however, help you find the records involved and present them so that it is easy to view and make the necessary decisions. R:BASE has a command, DELETE DUPLICATES, that will delete identical records from a table. It searches and compares records character by character, and if two records are identical, will delete one of them. Because it must compare character by character, it is not a fast command. And often, records are not identical. Two spaces between words instead of one means records are not identical. An M instead of an N means records are not identical. Most situations involve data where human interaction is needed to say "Yes, these records are the same" or where the comparison needs to be made on one or two or three columns, not the entire row. How can you get the computer to help you find possible duplicate data? What can you do to prevent this data being entered in the first place? Preventing Duplicates ===================== Use R:BASE data entry rules to prevent duplicate data from being entered. R:BASE 3.x and 4.0 have a default rule that requires data entered in a column to be unique. This rule is easily created through the menus by choosing Info Create/modify, then Rules and then Require a unique value. The menus prompt for the table and column name that must be unique. To require uniqueness across a group of columns means defining a custom rule. It's not that difficult. Examples of multi-column uniqueness rules are in the Command Dictionary section of the Reference Manual, under Rules. Select a table and column from the menu just like for a default rule. Then select SUCCEEDS. The basic structure for the WHERE clause for a SUCCEEDS rule is: WHERE column1 NOT IN (SELECT column1 FROM table T1 WHERE T1.column2=table.column2 AND t1.column3=table.column3 .... ) column1 is the column the rule is based on, the one you selected from the menu. It is one of the columns that makes up the uniqueness table is the table name the rule is based on, the table selected from the menu. column2, column3 etc. are the other columns that make up the uniqueness Note the use of a correlation name (T1) in the Sub-SELECT. This allows comparison of the values entered (table.column2, for example) with the values already in the table (T1.column2). The Sub-SELECT creates a list of all the column1 values from the table where column2 has the value entered and column3 has the value entered etc. Uniqueness rules don't allow use of the Add row and duplicate option in forms. Duplicating a row violates the rule. Let's look at an example. We want to make sure we don't enter two records for the same employee. But a single column uniqueness rule on the last name won't work; it's very possible to have 2 employees with the same last name. There can even be 2 employees with the same first and last name, living in the same city. It's unlikely, however, that they would have the same first name, the same last name and the same street address. These look like a good combination of columns for a rule that prevents the same employee being entered twice. A custom rule is required to check for uniqueness across multiple columns. Follow these steps to create the custom rule. 1. From the Info Create/modify menu, select Custom rule 2. Select the table name the rule will check data for; for this example, choose employee 3. Select the column name to verify, choose lastname 4. Select SUCCEEDS. What you select here affects how you write your custom rule. SUCCEEDS uses NOT IN or NOT EXISTS in the WHERE clause; FAILS uses IN or EXISTS. 5. Enter your WHERE clause: NOT EXISTS (SELECT lastname FROM employee T1 WHERE T1.lastname=employee.lastname AND T1.firstname=employee.firstname AND T1.address=employee.address) The SELECT statement creates a list of last names where the last name, first name and address that are already in the Employee table (T1) match the ones you're trying to enter (Employee). SUCCEEDS and NOT EXISTS are used because we don't want to find any records, i.e. we want an empty list, one that does not exist. 6. Enter the rule message. This is the message that displays when the data entered matches data already in the table. Finding Duplicates ================== Ever wonder why a company sends you two or three pieces of the same mailing? All with slightly different names? In this age of computers why can't they figure out that you're all the same person? Well, the computer can't make the decision that Smith and Smith are the same person, only a human can do that. What the computer can do is help identify possible duplicate records. Above you saw how R:BASE's data entry rules can keep duplicate data from being entered. But rules don't prevent misspellings or miskeying of data. An SQL SELECT command is used to search data for possible duplicates. Notice that the name is not used in the SELECT command to check for duplicates. We want to search for possibly misspelled names so we don't use name as one of our criteria. SELECT empid,COUNT(*)...........the column, empid, uniquely identifies a row FROM employee.................this is the table, Employee, to check for duplicates WHERE city IN.................one of the columns used in the duplicate comparison (SELECT city..................a Sub-SELECT is used to create a comparison list FROM employee T1 GROUP BY state,city,address...group the data together by the columns used to check for duplicates HAVING COUNT(*) > 1)..........count rows in the group, only return result if same state,city,address occur more than once GROUP BY empid................display the id and count only once for each id This select command displays the employee ID and the number of times it's found with the same state, city and street address. If any records are found, there may be duplicate data where the name or the zipcode was misspelled or miskeyed. Now those records can be displayed for editing and someone can review them and decide if they are indeed duplicates and which one to keep. Keep the same WHERE clause, but replace the SELECT command with a command such as EDIT that allows viewing and deleting of the data: EDIT * FROM employee + WHERE city IN + (SELECT city + FROM employee T1 + GROUP BY state,city,address + HAVING COUNT(*) > 1) + ORDER BY state,city,address,lastname Add the ORDER BY clause so when the data is displayed, the possible duplicate records are next to each other. Depending on the size of the database, this can be a large task. But if you're serious about finding and removing duplicates, it needs to be done. Deleting duplicates =================== As shown above, once the duplicates are found, a person can review the data and then delete the duplicate data. This way there is a choice over which record to keep. The computer can be programmed to do this task, but it will always keep the first record found. Here's an easy way to have the computer delete duplicates, keeping the first record. 1. Create a temporary table with the same structure as the table that has the duplicate records. PROJECT temp FROM employee USING ALL WHERE LIMIT=0 2. Create a multi-column uniqueness rule referencing the temporary table, Temp. Create an index on one of the columns to be compared, for example, lastname. This is important for performance. 3. Use the INSERT command to move the data from the original table to the temporary table. INSERT INTO temp SELECT * FROM employee As a row is added, the rule is checked. If a row already has been added, the rule will fail and the duplicate row will not be loaded. There is no way to count how many rows fail or to see what those rows are. The first row is loaded and subsequent rows are automatically thrown out. The table Temp then contains only one row with a particular lastname, firstname, street address combination, for example. For more information on Sub-SELECTs, multi-column uniqueness rules and using the SELECT command to find duplicates see articles "Making Sub-SELECTs, Outer Joins, & Rules Faster" (FAX server document #318) and "SQL Tips for Faster Applications" (FAX #314 and #321) from the July/August 1991 Exchange Technical Journal and "Protect Your Database with R:BASE Rules" (FAX #295) from the March/April 1991 Exchange. Not really duplicates ===================== Often duplicate data printed in a report is not really duplicated. The report may be based on a view and the view joins the tables in such a way that the duplicate data is created in the view but does not exist in the underlying tables. Duplicate data can appear in a view any time tables that have a many-many relationship are joined. Some of the columns are duplicated when tables with a one-many relationship are joined. Look at an example from the CONCOMP sample database. The Salesbonus table has many rows for each employee. The column empid is used to link with other tables, but does not uniquely identify a row. empid transdate netamount bonuspct bonus ------ --------- ----------- -------- --------------- 102 01/12/89 $176000.00 0.003 $528.00 102 02/27/89 $87500.00 0.002 $175.00 102 02/28/89 $22500.00 0. $0.00 102 03/01/89 $40500.00 0.001 $40.50 129 01/10/89 $76800.00 0.002 $153.60 129 02/23/89 $36625.00 0.001 $36.63 129 03/07/89 $56250.00 0.002 $112.50 133 01/03/89 $27000.00 0.001 $27.00 160 01/09/89 $9500.00 0. $0.00 160 02/23/89 $210625.00 0.003 $631.88 The Employee table has one row for each employee. The empid column does uniquely identify a row. empid emptitle empfname emplname empaddr ----- -------------- -------- ---------- ------------------ 102 Manager June Wilson 3278 Summit Drive 129 Manager Ernest Hernandez 12390 Windermere Dr. 133 Representative Peter Coffin 4105 29th Ave N.E. 160 Representative Mary Simpson 101 West Mercer 165 Representative Darnell Williams 8806 88th Street 166 Sales Clerk John Chou 5001 Main Street 167 Representative Sandi Watson 1002 S. Front Ave. When the two tables are joined (Employee.empid=Salesbonus.empid), the result displays as many rows for each employee (empid) as there are in the Salesbonus table. The data selected from the Employee table is repeated, but the data from the Salesbonus table is unique. data from data from Employee Salesbonus / \ / \ empid emptitle empid transdate netamount ----- -------------- ----- --------- ----------- 102 Manager 102 01/12/89 $176000.00 102 Manager 102 02/27/89 $87500.00 102 Manager 102 02/28/89 $22500.00 102 Manager 102 03/01/89 $40500.00 129 Manager 129 01/10/89 $76800.00 129 Manager 129 02/23/89 $36625.00 129 Manager 129 03/07/89 $56250.00 133 Representative 133 01/03/89 $27000.00 160 Representative 160 01/09/89 $9500.00 160 Representative 160 02/23/89 $210625.00 | | | | +--------------+----------------+ | the data is linked together by the empid column One row is created in the view for each row in the many table (Salesbonus) that matches a row in the one table (Employee). What happens if this view is now linked with the Transmaster table? The Transmaster table also includes the column empid, but like the Salesbonus table, can have many rows for each employee. transid custid empid transdate netamount freight ------- ------- ------ --------- ----------- --------- 4760 100 133 01/03/89 $27000.00 $270.00 4780 105 160 09/09/89 $9500.00 $95.00 4790 104 129 01/10/89 $63000.00 $630.00 4795 101 102 01/12/89 $176000.00 $1760.00 4800 105 160 09/23/89 $167250.00 $1672.50 4865 102 129 02/23/89 $29125.00 $291.25 4975 101 102 02/27/89 $87500.00 $875.00 4980 101 102 09/28/89 $22500.00 $225.00 5000 101 102 03/01/89 $29000.00 $290.00 5050 104 129 09/07/89 $56250.00 $562.50 5060 101 102 03/08/89 $30000.00 $300.00 5065 106 160 09/14/89 $140300.00 $1403.00 5070 104 129 03/15/89 $95500.00 $955.00 5075 102 129 09/16/89 $155500.00 $1555.00 5080 100 133 03/20/89 $80000.00 $800.00 When the Employee, Salesbonus and Transmaster tables are joined by linking the empid column, the view contains many extra rows of data. Look at the data for employee #102 (empid=102), for example. There are four rows for employee #102 in the Employee, Salesbonus view. Each of these four rows joins with each of the five rows in the Transmaster table for employee #102, making 20 rows in the resulting three-table view. data from data from data from Employee Salesbonus Transmaster / \ / \ / \ empid emptitle transdate netamount transid custid empid transdate ----- -------- --------- ---------- ------- ------ ----- --------- 102 Manager 01/12/89 $176000.00 4795 101 102 01/12/89 102 Manager 02/27/89 $87500.00 4795 101 102 01/12/89 102 Manager 02/28/89 $22500.00 4795 101 102 01/12/89 102 Manager 03/01/89 $40500.00 4795 101 102 01/12/89 102 Manager 01/12/89 $176000.00 4975 101 102 02/27/89 102 Manager 02/27/89 $87500.00 4975 101 102 02/27/89 102 Manager 02/28/89 $22500.00 4975 101 102 02/27/89 102 Manager 03/01/89 $40500.00 4975 101 102 02/27/89 102 Manager 01/12/89 $176000.00 4980 101 102 09/28/89 102 Manager 02/27/89 $87500.00 4980 101 102 09/28/89 102 Manager 02/28/89 $22500.00 4980 101 102 09/28/89 102 Manager 03/01/89 $40500.00 4980 101 102 09/28/89 This makes a lot of duplicated data in our final view. Only a few of the columns are unique to a row. If we don't select all the columns, we many even think we have duplicate rows. By using another column, transdate, to link the Transmaster table to the Salesbonus table, the view changes to a one-many-one relationship instead of a one-many-many. If you suspect there are extra rows in your view due to a many-many link, check it by selecting the data from the view and then from the underlying tables. With the above example, use the following commands to count the rows for a particular empid. The counts show that there are indeed extra rows in our view. R>SELECT COUNT(*) FROM employee WHERE empid=102 COUNT (*) ---------- 1 R>SELECT COUNT(*) FROM salesbonus WHERE empid=102 COUNT (*) ---------- 4 R>SELECT COUNT(*) FROM transmaster WHERE empid=102 COUNT (*) ---------- 5 R>SELECT COUNT(*) FROM bonus_view WHERE empid=102 COUNT (*) ---------- 20 It's much easier to detect this type of problem by picking a specific data value to test. How do you solve this? Look at your table structure, there may be another column that can used to link the tables to create a one relationship rather than a many relationship. Identify the columns that uniquely identify a row in each table. In the example above, the transdate column can be used to link the Sales- bonus table with the Transmaster table. As with all SQL commands, when creating views it's best to work with a small sample of data so you can easily verify the results.