Doc# 746 ===================================================================== Calculating the Mean, Median and Mode ===================================================================== Products: R:BASE 3.x, R:BASE 4.0, R:BASE 4.0a, R:BASE 4.5 ===================================================================== Area: Application Solutions CATALOG: Programming in R:BASE ===================================================================== The mean, the median and the mode are three common statistical measures for looking at how data is grouped around the center or midpoint of data. Many complex formulas and variations can be used; here we show how to calculate the mean, the median and the mode using simple raw data methods. The mean is the average -- the sum of all the values divided by the number of values. The mean is affected by values that are very different. A single very low value or very high value can skew the mean to the left or right. The median is the exact midpoint of the ordered set of values -- the number where half the values are below it and half are above it. With an odd number of values the median is the actual midpoint value. With an even number of rows, the median is the average of the two middle numbers. Very low or very high numbers do not affect the median as they do the mean. The mode is the single value with the most occurrences. If each value occurs only once, there is no mode. If two values occur with the same frequency, they are both considered the mode and the data is said to be bimodal. The data can also be multimodal where more than two values occur with the same frequency. For example, look at the mean, median and mode for this list of 10 test scores. 65 75 ---- The mode - the value with the The mean or 75 ---- most occurrences average is 84____ 77 87 ____ The median is 87.5, the 88 average of 87 and 88; 92 five values are above it, 93 five values below it 94 98 These measures when used together show if the data distribution is symmetrical or skewed. When the mean, median and mode are all the same or very close together, there is a symmetrical distribution. When the median is less than the mean, the data is skewed to the left. When the median is greater than the mean, the data is skewed to the right. The size of the difference indicates the degree the data is skewed. In the above example, the low score of 65 skews the mean slightly to the left. The mean or average is easy to calculate in R:BASE -- just use the AVG function with the SELECT command. The mode is also calculated using SELECT, but it requires two commands -- one to find the maximum number of occurrences in the data for any value, the second to find the actual values. The median is a more difficult number to find; the easiest method is to use an ordered view and the special R:BASE keyword COUNT. The command file below calculates the mean, median and mode for a selected column. The column cannot be a TEXT, NOTE, DATE or TIME datatype. The column and table name are held in variables. The calculations use these as ampersand variables. See the article "Working with Variables" in the September/October 1993 Exchange for more information on using ampersand variables. CLS SET MESSAGES OFF SET ERROR MESSAGES OFF CLEAR ALL VAR CHOOSE vtab FROM #TABLES AT CENTER,CENTER -- if no table selected, end the program SET VAR vkey = (LASTKEY(0)) IF vkey = '[Esc]' THEN RETURN ENDIF CHOOSE vcol FROM #COLUMNS IN &vtab CASCADE -- if no column selected, end the program SET VAR vkey = (LASTKEY(0)) IF vkey = '[Esc]' THEN RETURN ENDIF WRITE 'Processing...' AT 24,3 -- check to see if the selected column is a valid datataype SELECT sys_data_type INTO vdatatype FROM sys_columns WHERE sys_column_name = .vcol IF vdatatype IN (3,8,4,5) THEN CLS FROM 24,1 TO 24, 50 PAUSE FOR 30 USING 'You selected a TEXT,NOTE,DATE or TIME + column. You can''t calculate statistics on this column.' + AT CENTER,CENTER DEFAULT CLS RETURN ENDIF -- compute the mean (average) -- the AVG function automatically excludes NULL values SELECT AVG(&vcol) INTO vaverage vind1 FROM &vtab SET VAR vmsg1 = ('The mean is '+ CTXT(.vaverage) ) -- compute the median -- NULL values are excluded by using a WHERE clause -- on the CREATE VIEW command DROP VIEW median CREATE VIEW median as SELECT &vcol FROM &vtab + WHERE &vcol IS NOT NULL ORDER BY &vcol COMPUTE vrows AS ROWS FROM &vtab WHERE &vcol IS NOT NULL -- find out if the number of rows is odd or even SET VAR vmod = (MOD(.vrows,2)) -- the R:BASE WHERE clause keyword COUNT is used to find the -- row that is that number from the beginning of the table. -- The first row of a table has a COUNT of 1. By ordering the -- rows in a view, we can use the COUNT to easily find the -- midpoint. A WHERE clause is evaluated before an ORDER BY so -- we can't use this procedure directly on the table IF vmod <> 0 THEN -- an odd number of rows SET VAR vmid = (INT(.vrows/2)+1) SELECT &vcol INTO vmedian FROM median WHERE COUNT = .vmid ELSE -- an even number of rows SET VAR vmid1 = (.vrows/2) SET VAR vmid2 = ((.vrows/2)+1) SELECT &vcol INTO v_value1 FROM median WHERE COUNT = .vmid1 SELECT &vcol INTO v_value2 FROM median WHERE COUNT = .vmid2 SET VAR vmedian = ((.v_value1+.v_value2)/2) ENDIF SET VAR vmsg2 = ('The median is '+ CTXT(.vmedian) ) -- compute the mode -- NULL values are excluded by using a WHERE clause -- on the CREATE VIEW command -- find the maximum number of occurrences SELECT COUNT(*) INTO vmax vind2 FROM &vtab + WHERE &vcol IS NOT NULL GROUP BY &vcol ORDER BY 1 DESC IF vmax = 1 THEN CLS WRITE 'For the column',.vcol,'in the table',.vtab AT 7 17 WRITE .vmsg1 AT 9 20 WRITE .vmsg2 AT 11 20 WRITE 'There is no mode' AT 13 20 ELSE CLS WRITE 'For the column',.vcol,'in the table',.vtab AT 7 17 WRITE .vmsg1 AT 9 20 WRITE .vmsg2 AT 11 20 SET HEADINGS OFF WRITE 'the mode values ', 'the number of occurrences' + AT 13 20 -- position the cursor SET VAR vblank TEXT = ' ' SHOW VAR vblank AT 14,1 -- select all the values whose number of occurrences matches -- the maximum SELECT &vcol=30, COUNT(*)=15 FROM &vtab + GROUP BY &vcol HAVING COUNT(*) = .vmax SET HEADINGS ON ENDIF RETURN ------------ Please Note: The preceding article contains commands that are specific to R:BASE 4.5. The R:BASE 4.5 commands are included to demonstrate new features and capabilities. These commands must be excluded or modified if you choose to use these programming techniques with prior versions of R:BASE.