DOCUMENT #682 ======================================================================= WORKING WITH OUTER JOINS ======================================================================= Product: R:BASE Version : 3.1 & Higher ======================================================================= Area : DATA MANIPULATION Category: VIEWS ======================================================================= When linking tables in QBE (Query-by-Example), you usually choose to join them where the linking columns are equal, i.e. where you have data that matches in both tables. Sometimes, however, you also want to see the data where the linking column doesn't match. This option is not available from the menus; you can't view both the rows that match and the rows that don't match with a single query. You can, however, create a query like this from the R> prompt. It's called an "outer join." Even though you don't see it, when you create a query using the menus, R:BASE builds a SELECT command with a WHERE clause to link the tables and identify the rows to display. By definition, an outer join retrieves sets of completely different rows. A WHERE clause that displays all the rows that match between two or more tables is mutually exclusive from a WHERE clause that displays the rows that don't match. This means you need two different WHERE clauses. You might think you can see both sets of rows by creating a WHERE clause with two conditions separated by an OR operator, but this doesn't work. The right data is not retrieved. What you really need are two separate SELECT commands with two separate WHERE clauses. The menus won't let you do this. You need to use the UNION operator of the SELECT command at the R> prompt. The purpose of the UNION operator is to join two or more SELECT commands together to create one table of data. This article gives a few simple techniques to follow when creating outer joins and shows how to use the techniques with an actual example from the CONCOMP sample database. Identify The Tables And Data Conditions ======================================= The first step is to identify the tables to be joined and any conditions needed to qualify the data. Once the tables and conditions have been identified, sketch out the different select commands needed. Be sure to include the linking columns. Here's the basic structure of the complete SELECT command to create an outer join of two tables: SELECT data FROM table1 and table2 where the linking columns are equal, these are the matching rows UNION SELECT data FROM table1 where there is no matching data in table2 UNION SELECT data FROM table2 where there is no matching data in table1 You need three separate SELECT commands to cover all the possible combinations of selecting data from two tables. But you may not need all three of the SELECT commands. If you know that there is never a row in table2 if there isn't a row in table1, you won't need the third SELECT command. It's important to sketch out the different select commands you need before starting to write them because it becomes more complex with each table that you add to the join. When joining two tables, there can be up to three possible SELECT commands needed to retrieve all the data. When joining three tables, there can be up to seven possible SELECT commands. Look at the following table for a 3-table join. "YES" means there could be data in the table. | table 1 | table 2 | table 3 -------------------------------------------------------------------- SELECT1 | YES | YES | YES SELECT2 | YES | YES | NO SELECT3 | YES | NO | YES SELECT4 | YES | NO | NO SELECT5 | NO | YES | YES SELECT6 | NO | YES | NO SELECT7 | NO | NO | YES By knowing your database structure and how your data is organized you can reduce the number of SELECT commands needed to complete the outer join and won't need to account for all possible combinations of data. This is important because joining four tables for an outer join has over 15 possible SELECT commands. Write And Test Each SELECT Command By Itself ============================================ Once you have identified the tables and the conditions, and sketched out the different SELECT commands needed, begin writing the individual SELECT commands. Each SELECT command stands alone and is executed, tested and optimized by itself. Then all the working SELECT commands are combined into one using the UNION operator. When writing the individual SELECT commands, remember that each must select the same number of columns, and the columns must have the same datatypes and be selected in the same order. Start with the SELECT command that will retrieve the rows where the linking column matches across all the tables. The easiest way to write the SELECT command is to place each of its parts on a separate line like this: SELECT < the columns names you want to display data from > FROM < the table names the data is in > WHERE < put the linking column(s) here and any other data conditions > Use this same technique to structure the other SELECT commands. For the other SELECT commands, first fill out the FROM line. This identifies the table(s) this particular SELECT command will retrieve data from and keys both the SELECT and WHERE clauses. You need the same number of values (either column names or constants) after the keyword SELECT as in the first command. The FROM will determine which values are constants and which are column names. Make sure they are in the same order as the first command. The WHERE clause will use a sub-SELECT to identify the rows that are NOT IN the other table(s). A Sub-SELECT makes a list of data values to compare to. An example ========== Let's look at a specific example from the CONCOMP sample database. The desired result is a report of transaction information for each customer for the month of March. Not all customers had transactions in March, but all customers must be listed and show $0.00 if they had no transactions. First, identify the tables and data conditions that are needed and sketch out the SELECT commands. Transaction information is stored in the Transmaster table. Customer information is in the Customer table. The tables are linked by the custid column. The first SELECT command selects the rows from the Transmaster and Customer tables where the customer has a transaction record, i.e. the rows that match between the two tables. SELECT customer number (customer), company name (customer), transaction number (transmaster), transaction date(transmaster), transaction amount (transmaster), FROM customer,transmaster WHERE custid = custid and transdate in March Notice that the actual column names aren't used. You can, but when sketching out the SELECT commands it's often easier to not use the actual names, but just put the description of the data you want to see. Put the table name next to data description so it's easy to see which table those items are from. Next, sketch out the other SELECT commands to find the rows that don't match. Data entry rules are defined in the database, so there can't be a transaction (a row in the Transmaster table) if there is no customer record. This leaves just one additional SELECT command, to find the rows from the Customer table where the custid is not in the Transmaster table. SELECT customer number (customer), company name (customer), transaction number (constant), transaction date (constant), transaction amount (constant) FROM customer WHERE custid NOT IN (SELECT custid FROM transmaster WHERE transdate in March) Here data is only selected from the Customer table, so the data values in the first SELECT that were from the Transmaster table become constants in this SELECT command. Now write and test the actual SELECT commands. The first one, a straightforward multi-table SELECT command, looks like this: columns from the columns from the Transmaster table Customer table / \ / \ / \ SELECT custid,company,transid,transdate,netamount + FROM customer,transmaster + WHERE customer.custid=transmaster.custid AND + / transmaster.transdate BETWEEN 3/1/89 AND 3/31/89 / \ links the tables \ qualifies the rows Notice that in the WHERE clause, in addition to linking the tables to find the rows that match (customer.custid=transmaster.custid), there is the additional condition to retrieve only the records from March. This second condition is on a column (transdate) from the Transmaster table only. After this SELECT command works to retrieve the desired data, write and test the next command. This is just a single table select. Because there is no data from the Transmaster table no columns are selected from that table. "Place holders" are used instead, either a constant value or '' (quote quote) for null. There are no tables to link together in the WHERE clause. Instead, use a Sub-SELECT to select the rows from the Customer table that are not in the Transaction table. place holders to match columns from the columns from the Transmaster table Customer table / \ / \ / \ SELECT custid,company,0,'',$0 + FROM customer + WHERE custid NOT IN (SELECT custid FROM transmaster WHERE + / transdate BETWEEN 3/1/89 AND 3/31/89 ) / \ finds the rows that \ don't match only compare to rows in the right date range Remember that in the first select there was a WHERE clause based on the Transmaster table. The second select is from the Customer table only, so there is no Transdate column to put a condition on. To make sure the right rows are retrieved, that WHERE clause condition must be put on the Sub-SELECT, which is from the Transmaster table. The data includes only customers who didn't have transactions in March. If there is no WHERE clause on the sub-SELECT to restrict the date to March, this SELECT command would have returned customers who, in some month, sometime, had no transactions; not the correct data. Testing each SELECT command separately makes sure it is returning correct data. If each of the individual SELECTs return correct data, then the complete SELECT will return correct data. Finally, join the two SELECT commands with the UNION operator: SELECT custid,company,transid,transdate,netamount + FROM customer,transmaster + WHERE customer.custid=transmaster.custid AND + transmaster.transdate BETWEEN 3/1/89 AND 3/31/89 + UNION + SELECT custid,company,0,'',$0 + FROM customer + WHERE custid NOT IN (SELECT custid FROM transmaster WHERE + transdate BETWEEN 3/1/89 AND 3/31/89 ) This is the resulting data: custid company UNNAMED UNNAMED UNNAMED -------------------------------------------------------------- 100 PC Distribution Inc. 5080 03/20/89 $80000.00 101 Computer Distrib Inc. 5000 03/01/89 $29000.00 101 Computer Distrib Inc. 5060 03/08/89 $30000.00 102 Industrial Computers 5075 03/16/89 $155500.00 103 Computer Mountain Inc. 5015 03/06/89 $42000.00 104 Industrial Concepts 5050 03/07/89 $56250.00 104 Industrial Concepts 5070 03/15/89 $95500.00 106 Computer Warehouse 5065 03/14/89 $140300.00 107 Midtown Computer Co. 5010 03/03/89 $108750.00 107 Midtown Computer Co. 5085 03/19/8 $74250.00 110 Southwest Computers 0 -0- $0.00 105 PC Consult and Design 0 -0- $0.00 Each successive select that is joined on with the UNION operator simply adds rows to the temporary table whose structure is defined by the first SELECT command. Notice that three of the columns are titled "UNNAMED". These are the columns where there is data only from the first SELECT command, the second SELECT command isn't selecting data from these columns. Continue to combine SELECT commands with the UNION operator making sure each one works by itself. Test and optimize each SELECT command separately, then combine them. For information on optimizing the SELECT commands see the articles "Optimizing Application code for Speed" in the May/June 1992 Exchange (automated FAX server document #641, 206-649-2789) and "Making SUB-SELECTS, Outer Joins, & Rules Faster" in the July/August 1991 Exchange (FAX document #318). Create the view =============== Once the complete SELECT command is working, simply preface it with CREATE VIEW viewname AS and a view is created. A view is simply a saved SELECT command. But the view has "UNNAMED" columns, and "UNNAMED" columns can't be referenced in a report. To access all the columns in the view, give the columns names when the view is created. Using the above example, CREATE VIEW test (custid,company,transid,transdate,netamount) AS + SELECT custid,company,transid,transdate,netamount + FROM customer,transmaster + WHERE customer.custid=transmaster.custid AND transdate + BETWEEN 3/1/89 AND 3/31/89 + UNION + SELECT custid,company,0,'',$0 + FROM customer + WHERE custid NOT IN (SELECT custid FROM transmaster WHERE transdate + BETWEEN 3/1/89 AND 3/31/89 ) The column names can be the same names as the column names in the tables or they can be unique to the view. Just be sure to name all the columns being selected. Once you have mastered the technique, outer joins are no longer a mystery. Use the techniques described above to successfully join any number of tables together. Examples Using More Than Two Tables =================================== Taking the example above, how would it change to include data from the Employee table as well? Add the columns from the Employee table to the first SELECT command and add an additional link to the WHERE clause. The second SELECT command doesn't change except for adding place holders for the employee information. There is no link between the Customer and Employee tables. SELECT custid,company,transid,transdate,netamount,EMPLNAME,EMPCITY FROM customer,transmaster,EMPLOYEE WHERE customer.custid=transmaster.custid AND TRANSMASTER.EMPID=EMPLOYEE.EMPID AND transdate BETWEEN 3/1/89 AND 3/31/89 UNION SELECT custid,company,0,'',$0,'','' FROM customer WHERE custid NOT IN (SELECT custid FROM transmaster WHERE transdate BETWEEN 3/1/89 AND 3/31/89 ) This is the resulting data. Note that the additional columns are UNNAMED because they are constants in the second SELECT command. UNNAMED UNNAMED UNNAMED UNNAMED UNNAMED UNNAMED UNNAMED ------------------------------------------------------------------------------ 100 PC Distribution Inc. 5080 03/20/89 $80000.00 Coffin Duvall 101 Computer Distrib Inc. 5000 03/01/89 $29000.00 Wilson Seattle 101 Computer Distrib Inc. 5060 03/08/89 $30000.00 Wilson Seattle 107 Midtown Computer Co. 5085 03/19/89 $74250.00 Smith Seattle ....... 110 Southwest Computers 0 -0- $0.00 -0- -0- 105 PC Consult and Design 0 -0- $0.00 -0- -0- What if information from the Contact table was added instead? Again, the contact information is added and linked in using the first SELECT command. There is a link between the Customer and Contact tables so the second SELECT includes the columns from the Contact table also. SELECT custid,company,transid,transdate,netamount,CONTLNAME FROM customer,transmaster,CONTACT WHERE customer.custid=transmaster.custid AND CUSTOMER.CUSTID=CONTACT.CUSTID AND transdate BETWEEN 3/1/89 AND 3/31/89 UNION SELECT custid,company,0,'',$0,CONTLNAME FROM customer,CONTACT WHERE CUSTOMER.CUSTID=CONTACT.CUSTID AND custid NOT IN (SELECT custid FROM transmaster WHERE transdate BETWEEN 3/1/89 AND 3/31/89 ) This is the resulting data. Note the additional column for contlname. It is not UNNAMED because it is selected in both SELECT commands. custid company UNNAMED UNNAMED UNNAMED CONTLNAME ------------------------------------------------------------------------ 100 PC Distribution Inc. 5080 03/20/89 $80000.00 Estwitz 100 PC Distribution Inc. 5080 03/20/89 $80000.00 Sabini 101 Computer Distrib Inc. 5000 03/01/89 $29000.00 Finnegan 101 Computer Distrib Inc. 5000 03/01/89 $29000.00 Preston 107 Midtown Computer Co. 5085 03/19/89 $74250.00 Brady ....... 110 Southwest Computers 0 -0- $0.00 Adams 105 PC Consult and Design 0 -0- $0.00 Chin What would the SELECT command look like if we added the Employee information and also wanted to see those employees who made no transactions in March? This requires the addition of a third SELECT command. The first SELECT command finds and displays information about customers and employees with transactions in March; the second SELECT displays the information about customers who didn't have transactions in March; the third SELECT displays the information about employees who didn't have transactions in March. SELECT custid,company,transid,transdate,netamount,EMPLNAME,EMPCITY FROM customer,transmaster,EMPLOYEE WHERE customer.custid=transmaster.custid AND TRANSMASTER.EMPID=EMPLOYEE.EMPID AND transdate BETWEEN 3/1/89 AND 3/31/89 UNION SELECT custid,company,0,'',$0,'','' FROM customer WHERE custid NOT IN (SELECT custid FROM transmaster WHERE transdate BETWEEN 3/1/89 AND 3/31/89 ) UNION SELECT 0,'',0,'',$0,EMPLNAME,EMPCITY FROM EMPLOYEE WHERE EMPID NOT IN (SELECT EMPID FROM transmaster WHERE transdate BETWEEN 3/1/89 AND 3/31/89 ) This is the resulting data. Note that all columns are now UNNAMED. UNNAMED UNNAMED UNNAMED UNNAMED UNNAMED UNNAMED UNNAMED ------------------------------------------------------------------------------- 100 PC Distribution Inc. 5080 03/20/89 $80000.00 Coffin Duvall 101 Computer Distrib Inc. 5000 03/01/89 $29000.00 Wilson Seattle 101 Computer Distrib Inc. 5060 03/08/89 $30000.00 Wilson Seattle 102 Industrial Computers 5075 03/16/89 $155500.00 Hernandez Seattle 103 Computer Mountain 5015 03/06/89 $42000.00 Smith Seattle 104 Industrial Concepts 5050 03/07/89 $56250.00 Hernandez Seattle 104 Industrial Concepts 5070 03/15/89 $95500.00 Hernandez Seattle 106 Computer Warehouse 5065 03/14/89 $140300.00 Simpson Redmond 107 Midtown Computer Co. 5010 03/03/89 $108750.00 Smith Seattle 107 Midtown Computer Co. 5085 03/19/89 $74250.00 Smith Seattle 110 Southwest Computers 0 -0- $0.00 -0- -0- 105 PC Consult and Design 0 -0- $0.00 -0- -0- 0 -0- 0 -0- $0.00 Chou Woodinvi 0 -0- 0 -0- $0.00 Watson Redmond 0 -0- 0 -0- $0.00 Williams Seattle