DOCUMENT #683
     =======================================================================
     SEEING DOUBLE
     =======================================================================
     Product:  R:BASE                   Version :  3.1 & Higher
     =======================================================================
     Area   :  DATA MANIPULATION        Category:  DUPLICATES                 
     =======================================================================
 
 
     It happens.  You print the month-end report and see two records 
     for account #102.  You've merged databases and need to check for 
     duplicates.  You have both John Smith and John C. Smith at the 
     same address.  Are they the same person?  You have a record for 
     John Smith with a 5-digit zipcode and a record for John Smith with 
     a 10-digit zipcode.  All of these situations involve what may be 
     duplicate records.  What can you do to find the records involved and 
     delete the duplicates?  
 
     Browsing the data is one way, but not if you have 10,000 records. 
     And some situations require human interaction.  For example, the 
     computer cannot make the decision that John Smith and John C. Smith 
     are the same person.  The computer can, however, help you find the 
     records involved and present them so that it is easy to view and make 
     the necessary decisions.
 
     R:BASE has a command, DELETE DUPLICATES, that will delete identical 
     records from a table.  It searches and compares records character by 
     character, and if two records are identical, will delete one of them.  
     Because it must compare character by character, it is not a fast 
     command.  And often, records are not identical.  Two spaces between 
     words instead of one means records are not identical.  An M instead 
     of an N means records are not identical.  Most situations involve data 
     where human interaction is needed to say "Yes, these records are the 
     same" or where the comparison needs to be made on one or two or three 
     columns, not the entire row.
 
     How can you get the computer to help you find possible duplicate data?  
     What can you do to prevent this data being entered in the first place?
 
 
     Preventing Duplicates
     =====================
     Use R:BASE data entry rules to prevent duplicate data from being 
     entered.  R:BASE 3.x and 4.0 have a default rule that requires data 
     entered in a column to be unique.  This rule is easily created through 
     the menus by choosing Info Create/modify, then Rules and then Require 
     a unique value.  The menus prompt for the table and column name that 
     must be unique.
 
     To require uniqueness across a group of columns means defining a custom 
     rule.  It's not that difficult.  Examples of multi-column uniqueness 
     rules are in the Command Dictionary section of the Reference Manual, 
     under Rules.  Select a table and column from the menu just like for a 
     default rule.  Then select SUCCEEDS.  The basic structure for the WHERE 
     clause for a SUCCEEDS rule is:
 
       WHERE column1 NOT IN (SELECT column1 FROM table T1 WHERE
                 T1.column2=table.column2 AND t1.column3=table.column3 .... )
 
     column1 is the column the rule is based on, the one you selected from 
     the menu.  It is one of the columns that makes up the uniqueness 
     
     table is the table name the rule is based on, the table selected from 
     the menu.  
 
     column2, column3 etc. are the other columns that make up the uniqueness
     
     Note the use of a correlation name (T1) in the Sub-SELECT.  This allows 
     comparison of the values entered (table.column2, for example) with the 
     values already in the table (T1.column2).  The Sub-SELECT creates a 
     list of all the column1 values from the table where column2 has the 
     value entered and column3 has the value entered etc.   
 
     Uniqueness rules don't allow use of the Add row and duplicate option 
     in forms.  Duplicating a row violates the rule.
 
     Let's look at an example.  We want to make sure we don't enter two 
     records for the same employee.  But a single column uniqueness rule 
     on the last name won't work; it's very possible to have 2 employees 
     with the same last name.  There can even be 2 employees with the same 
     first and last name, living in the same city.  It's unlikely, however, 
     that they would have the same first name, the same last name and the 
     same street address.  These look like a good combination of columns 
     for a rule that prevents the same employee being entered twice.  
 
     A custom rule is required to check for uniqueness across multiple 
     columns.  Follow these steps to create the custom rule. 
 
     1. From the Info Create/modify menu, select Custom rule
 
     2. Select the table name the rule will check data for; for this
     example, choose employee  
 
     3. Select the column name to verify, choose lastname  
 
     4. Select SUCCEEDS.  What you select here affects how you write your 
     custom rule.  SUCCEEDS uses NOT IN or NOT EXISTS in the WHERE clause; 
     FAILS uses IN or EXISTS. 
 
     5. Enter your WHERE clause: 
         NOT EXISTS (SELECT lastname FROM employee T1 WHERE
         T1.lastname=employee.lastname AND T1.firstname=employee.firstname 
         AND T1.address=employee.address)
 
     The SELECT statement creates a list of last names where the last name, 
     first name and address that are already in the Employee table (T1) 
     match the ones you're trying to enter (Employee).  SUCCEEDS and NOT 
     EXISTS are used because we don't want to find any records, i.e. we 
     want an empty list, one that does not exist.  
 
     6. Enter the rule message.  This is the message that displays when
     the data entered matches data already in the table.
 
 
     Finding Duplicates
     ==================
     Ever wonder why a company sends you two or three pieces of the same 
     mailing?  All with slightly different names?  In this age of computers 
     why can't they figure out that you're all the same person?  Well, the 
     computer can't make the decision that Smith and Smith are the same 
     person, only a human can do that.  What the computer can do is help 
     identify possible duplicate records.
 
     Above you saw how R:BASE's data entry rules can keep duplicate data 
     from being entered.  But rules don't prevent misspellings or miskeying 
     of data.  An SQL SELECT command is used to search data for possible 
     duplicates.  Notice that the name is not used in the SELECT command 
     to check for duplicates.  We want to search for possibly misspelled 
     names so we don't use name as one of our criteria.
 
     SELECT empid,COUNT(*)...........the column, empid, uniquely identifies 
                                     a row
 
     FROM employee.................this is the table, Employee, to check 
                                   for duplicates
 
     WHERE city IN.................one of the columns used in the 
                                   duplicate comparison 
                              
     (SELECT city..................a Sub-SELECT is used to create a
                                    comparison list
     FROM employee T1              
     
     GROUP BY state,city,address...group the data together by the columns
                                   used to check for duplicates
 
     HAVING COUNT(*) > 1)..........count rows in the group, only return 
                                   result if same state,city,address occur 
                                   more than once
 
     GROUP BY empid................display the id and count only once for 
                                   each id
 
     This select command displays the employee ID and the number of times 
     it's found with the same state, city and street address.  If any 
     records are found, there may be duplicate data where the name or the 
     zipcode was misspelled or miskeyed.  Now those records can be displayed 
     for editing and someone can review them and decide if they are indeed 
     duplicates and which one to keep.  
 
     Keep the same WHERE clause, but replace the SELECT command with a 
     command such as EDIT that allows viewing and deleting of the data:
 
      EDIT * FROM employee +
      WHERE city IN +
        (SELECT city +
         FROM employee T1 +
         GROUP BY state,city,address +
         HAVING COUNT(*) > 1) +
      ORDER BY state,city,address,lastname
 
     Add the ORDER BY clause so when the data is displayed, the possible 
     duplicate records are next to each other.  Depending on the size of 
     the database, this can be a large task.  But if you're serious about 
     finding and removing duplicates, it needs to be done.
 
 
     Deleting duplicates
     ===================
     As shown above, once the duplicates are found, a person can review the 
     data and then delete the duplicate data.  This way there is a choice 
     over which record to keep.  The computer can be programmed to do this 
     task, but it will always keep the first record found. Here's an easy 
     way to have the computer delete duplicates, keeping the first record.
 
     1. Create a temporary table with the same structure as the table that 
     has the duplicate records.
 
        PROJECT temp FROM employee USING ALL WHERE LIMIT=0
 
     2. Create a multi-column uniqueness rule referencing the temporary 
     table, Temp.  Create an index on one of the columns to be compared, 
     for example, lastname.  This is important for performance.
 
     3. Use the INSERT command to move the data from the original table to 
     the temporary table.
 
        INSERT INTO temp SELECT * FROM employee
 
     As a row is added, the rule is checked.  If a row already has been 
     added, the rule will fail and the duplicate row will not be loaded.  
     There is no way to count how many rows fail or to see what those rows 
     are.  The first row is loaded and subsequent rows are automatically 
     thrown out.  The table Temp then contains only one row with a 
     particular lastname, firstname, street address combination, for example.
 
     For more information on Sub-SELECTs, multi-column uniqueness rules 
     and using the SELECT command to find duplicates see articles "Making 
     Sub-SELECTs, Outer Joins, & Rules Faster" (FAX server document #318) 
     and "SQL Tips for Faster Applications" (FAX #314 and #321) from the 
     July/August 1991 Exchange Technical Journal and "Protect Your 
     Database with R:BASE Rules" (FAX #295) from the March/April 1991 
     Exchange.
     
     
     Not really duplicates
     =====================
     Often duplicate data printed in a report is not really duplicated.  
     The report may be based on a view and the view joins the tables in 
     such a way that the duplicate data is created in the view but does 
     not exist in the underlying tables.  Duplicate data can appear in a 
     view any time tables that have a many-many relationship are joined. 
     Some of the columns are duplicated when tables with a one-many 
     relationship are joined.
 
     Look at an example from the CONCOMP sample database.  The Salesbonus 
     table has many rows for each employee.  The column empid is used to 
     
     link with other tables, but does not uniquely identify a row.  
 
     empid   transdate  netamount    bonuspct  bonus
     ------  ---------  -----------  --------  ---------------
     102     01/12/89   $176000.00   0.003     $528.00
     102     02/27/89   $87500.00    0.002     $175.00
     102     02/28/89   $22500.00    0.        $0.00
     102     03/01/89   $40500.00    0.001     $40.50
     129     01/10/89   $76800.00    0.002     $153.60
     129     02/23/89   $36625.00    0.001     $36.63
     129     03/07/89   $56250.00    0.002     $112.50
     133     01/03/89   $27000.00    0.001     $27.00
     160     01/09/89   $9500.00     0.        $0.00
     160     02/23/89   $210625.00   0.003     $631.88
     
     The Employee table has one row for each employee.  The empid column 
     does uniquely identify a row.
 
      empid  emptitle        empfname  emplname     empaddr
      -----  --------------  --------  ----------   ------------------      
      102    Manager         June      Wilson       3278 Summit Drive     
      129    Manager         Ernest    Hernandez    12390 Windermere Dr.     
      133    Representative  Peter     Coffin       4105 29th Ave N.E.     
      160    Representative  Mary      Simpson      101 West Mercer     
      165    Representative  Darnell   Williams     8806 88th Street     
      166    Sales Clerk     John      Chou         5001 Main Street     
      167    Representative  Sandi     Watson       1002 S. Front Ave.   
 
     When the two tables are joined (Employee.empid=Salesbonus.empid), the 
     result displays as many rows for each employee (empid) as there are 
     in the Salesbonus table.  The data selected from the Employee table 
     is repeated, but the data from the Salesbonus table is unique.  
  
            data from                      data from
            Employee                       Salesbonus
          /          \                   /            \
       empid  emptitle            empid  transdate  netamount  
       -----  --------------      -----  ---------  -----------      
       102    Manager             102    01/12/89   $176000.00    
       102    Manager             102    02/27/89   $87500.00    
       102    Manager             102    02/28/89   $22500.00     
       102    Manager             102    03/01/89   $40500.00     
       129    Manager             129    01/10/89   $76800.00     
       129    Manager             129    02/23/89   $36625.00     
       129    Manager             129    03/07/89   $56250.00     
       133    Representative      133    01/03/89   $27000.00     
       160    Representative      160    01/09/89   $9500.00     
       160    Representative      160    02/23/89   $210625.00   
                 |                               |
                 |                               |
                 +--------------+----------------+
                                |
        the data is linked together by the empid column
 
     One row is created in the view for each row in the many table 
     (Salesbonus) that matches a row in the one table (Employee). 
 
     What happens if this view is now linked with the Transmaster table?  
     The Transmaster table also includes the column empid, but like the 
     Salesbonus table, can have many rows for each employee. 
 
      transid  custid   empid   transdate  netamount    freight
      -------  -------  ------  ---------  -----------  ---------
      4760     100      133     01/03/89   $27000.00    $270.00
      4780     105      160     09/09/89   $9500.00     $95.00
      4790     104      129     01/10/89   $63000.00    $630.00
      4795     101      102     01/12/89   $176000.00   $1760.00
      4800     105      160     09/23/89   $167250.00   $1672.50
      4865     102      129     02/23/89   $29125.00    $291.25
      4975     101      102     02/27/89   $87500.00    $875.00
      4980     101      102     09/28/89   $22500.00    $225.00
      5000     101      102     03/01/89   $29000.00    $290.00
      5050     104      129     09/07/89   $56250.00    $562.50
      5060     101      102     03/08/89   $30000.00    $300.00
      5065     106      160     09/14/89   $140300.00   $1403.00
      5070     104      129     03/15/89   $95500.00    $955.00
      5075     102      129     09/16/89   $155500.00   $1555.00
      5080     100      133     03/20/89   $80000.00    $800.00
     
     When the Employee, Salesbonus and Transmaster tables are joined by 
     linking the empid column, the view contains many extra rows of data.  
     Look at the data for employee #102 (empid=102), for example.  There 
     are four rows for employee #102 in the Employee, Salesbonus view.  
     Each of these four rows joins with each of the five rows in the 
     Transmaster table for employee #102, making 20 rows in the resulting 
     three-table view.  
 
         data from           data from                  data from
         Employee            Salesbonus                 Transmaster
       /          \        /           \              /            \
      empid emptitle   transdate netamount    transid custid empid transdate 
      ----- --------   --------- ----------   ------- ------ ----- ---------   
   
      102   Manager    01/12/89  $176000.00   4795    101    102   01/12/89
      102   Manager    02/27/89  $87500.00    4795    101    102   01/12/89
      102   Manager    02/28/89  $22500.00    4795    101    102   01/12/89
      102   Manager    03/01/89  $40500.00    4795    101    102   01/12/89
      102   Manager    01/12/89  $176000.00   4975    101    102   02/27/89
      102   Manager    02/27/89  $87500.00    4975    101    102   02/27/89
      102   Manager    02/28/89  $22500.00    4975    101    102   02/27/89
      102   Manager    03/01/89  $40500.00    4975    101    102   02/27/89
      102   Manager    01/12/89  $176000.00   4980    101    102   09/28/89
      102   Manager    02/27/89  $87500.00    4980    101    102   09/28/89
      102   Manager    02/28/89  $22500.00    4980    101    102   09/28/89
      102   Manager    03/01/89  $40500.00    4980    101    102   09/28/89
     
     This makes a lot of duplicated data in our final view.  Only a few 
     of the columns are unique to a row.  If we don't select all the 
     columns, we many even think we have duplicate rows.  By using 
     another column, transdate, to link the Transmaster table to the 
     Salesbonus table, the view changes to a one-many-one relationship 
     instead of a one-many-many.  
 
     If you suspect there are extra rows in your view due to a many-many 
     link, check it by selecting the data from the view and then from the 
     underlying tables.  With the above example, use the following commands 
     to count the rows for a particular empid.  The counts show that there 
     are indeed extra rows in our view.  
 
     R>SELECT COUNT(*) FROM employee WHERE empid=102                           
      
       COUNT (*)                                                               
       
       ----------                                                              
       
                1                                                              
 
     R>SELECT COUNT(*) FROM salesbonus WHERE empid=102                         
      
       COUNT (*)                                                               
       
       ----------                                                              
       
                4                                                              
 
     R>SELECT COUNT(*) FROM transmaster WHERE empid=102                        
      
       COUNT (*)                                                               
       
       ----------                                                              
       
                5
                                                                     
     R>SELECT COUNT(*) FROM bonus_view WHERE empid=102                         
           
       COUNT (*)                                                               
       
       ----------                                                              
       
               20
 
     It's much easier to detect this type of problem by picking a specific 
     data value to test.  How do you solve this?  Look at your table 
     structure, there may be another column that can used to link the 
     tables to create a one relationship rather than a many relationship.  
     Identify the columns that uniquely identify a row in each table.  In 
     the example above, the transdate column can be used to link the Sales-
     bonus table with the Transmaster table.  As with all SQL commands, 
     when creating views it's best to work with a small sample of data so 
     you can easily verify the results.