## Linear single variable regression analysis

Consider a firm of architects that has data on 30 types of houses from a project. The database has data on the cost price and on both the gross and usable floor-space per house. Table 4.1 shows the content of the database. For instance, type 3A has a cost price of € 98,352 and has a gross floorspace of 163 m2 and a usable floorspace of 122.2 m2.

The firm wishes to develop a formula to predict the cost price of a (similar) house based on either the gross or usable floorspace.

Figure 4.1 shows a graph with the cost price on the X-axis, the gross floor-space on the Y-axis and all data points that can be derived from Table 4.1. It can be seen that there is a relation between the cost price and the gross floor-space. A line can be drawn by hand through all of these points as shown in Figure 4.2.

This line through the cloud of points, is represented by the equation:

Where:

 x independent variable y = dependent variable e error term = intercept of line y = + x ft = slope of line y = + x

Table 4.1 Cost price vs. floorspace

 Type Cost price Gross Usable floorspace floorspace [ C] [m2] [m2] 3A 98,352 163 122.20 3AK 117,887 163 122.20 3B 121,408 173 122.20 3C 106,772 163 128.10 8A 95,422 142 111.40 8B 108,140 168 132.70 8C 107,689 161 125.30 5B 104,175 178 133.70 5C 121,962 188 133.70 5E 112,513 178 139.20 6A 100,905 158 122.70 6B 114,288 194 151.90 6C 122,941 167 122.90 6D 131,675 172 126.40 7A 95,360 147 121.10 7AK 112,445 147 121.10 7B 117,433 195 148.20 7BK 134,518 195 148.20 7E 114,288 194 151.90 7F 113,625 147 121.10 1A 139,567 169 153.00 1AK 177,639 169 153.00 1AK2 184,950 204 184.60 4A 136,986 210 171.70 4B 191,438 215 192.80 4C 192,753 235 203.00 4D 167,405 215 182.40 5A 118,838 194 161.70 5AK 140,294 194 161.70 5D 147,366 203 166.80

195000 -185000 -175000 -165000 -155000 145000 -135000 -125000 -115000 -105000 95000

130 140 150 160 170 180 190 200 210 220 230 240 Gross floorspace

Figure 4.1 Graph showing all data points

195000 185000 175000 165000 155000

■H 145000 135000

° 125000 115000 105000 95000 85000 75000

130 140 150 160 170 180 190 200 210 220 230 240 Gross floorspace Figure 4.2 Graph showing a line that fits the data points Figure 4.3 The three deviations associated with a data point (Source: Aczel (2002))

This is a theoretical relationship which we presume to exist. The error term is by chance, so every point in the cloud of points contains some error. However, we can make an estimate of the theoretical relationship by means of statistical estimation techniques. We construct a straight line that fits the cloud of points best (the vertical squared distances from the different points to the line are minimal - the method of least squares). In this way we can find the best estimates for the unknown beta's with the associated statistical properties, such as standard errors.

In practice, this equation can be calculated from the data set by a computer or a hand calculator with a built in routine for linear regression.

Once we have established that a linear relationship exists between the two variables, the question arises: How strong is the relationship? In other words: To what extent are deviations from the mean explained by the regression?

For any data point (x, y) we can distinguish three deviations from the mean y. See Figure 4.3:

In words: The total deviation equals the unexplained deviation (error) plus the explained deviation (regression).

Squaring and summing over all n points (cross terms drop out) yields: Figure 4.4 Coefficients of determination in different regressions (Source: Aczel (2002))

Total sum of squares = Sum of squares for error + Sum of squares for regression.

In words: The total variation (TSS) equals the unexplained variation (SSE) plus the explained variation (SSR).

The ratio explained variation divided by total variation is called the coefficient of determination:

r = correlation coefficient; —1 < r < 1, so 0 < r2 < 1.

Between the two extremes, r2 _ 0 (no fit at all) and r2 _ 1 (perfect fit), values of r2 give an indication of the relative fit of the regression model to the data. The higher r2, the better the fit and the higher our confidence in the regression (Fig. 4.4).

To carry out the regression analysis we first add the data to an empty Excelsheet as shown in Figure 4.5.

Because there are two candidate independent variables (gross and usable floorspace) we first carry out a correlation analysis to find out which of these ali '.¿.UUJLUWWW^^^M Jjjjx]

Fie Edit View Insert Format Tools Data Window WB! Help Type a question for help - _ iS X

i Ú sä a a a i <a a i ^ & i * ^ a - & i • -1 a E - si ii i m &ioo% - © 1

i Anal -10 - B i u m m m MI \$ % ♦ tdS ¿s ww EE - & - A -1

Fie Edit View Insert Format Tools Data Window WB! Help Type a question for help - _ iS X

i Ú sä a a a i <a a i ^ & i * ^ a - & i • -1 a E - si ii i m &ioo% - © 1

i Anal -10 - B i u m m m MI \$ % ♦ tdS ¿s ww EE - & - A -1

A1 _fx Type

A

B

C

D

E

F

G

H

I

J

K

L T"

1

Cost price

Gross fe.

Usable fs.

2

Type 3A

98352

163

122.2

3

Type 3A

117887

163

122.2

4

Type 3B

121408

173

122.2

5

Type 3C

106772

163

128.1

6

Type 8A

95422

142

111.4

7

Type 8B

108140

168

132.7

8

Type 8C

107689

161

125.3

9

Type 8D

111181

137

93.5

10

Type 5B

104175

178

133.7

11

Type 5C

121962

188

133.7

12

Type 5E

112513

178

139.2

13

Type 6A

100905

158

122.7

14

Type 6B

114288

194

151.9

15

Type 6C

122941

167

122.9

16

Type 6D

131675

172

126.4

17

Type 7A

95360

147

121.1

18

Type 7A

112445

147

121.1

ill

19

Type 7B

117433

195

148.2

20

Type 7B

134518

195

148.2

ü-

21

Type 7E

114288

194

151.9

22

Type 7F

113625

147

121.1

23

Type 1A

139567

169

153

?4

Tvpe 1A

177639

169

153

► ir

H < ► h Xsheetl/Sheet2 /Sheet3

<l

Figure 4.5 Data from database in spreadsheet fits the data best. In cell C33 we type:

= correl(B2:B32,C2:C32) In cell D33 we type:

This shows that the usable fioorspace has a larger correlation coefficient (r = 0.83) than the gross fioorspace (r = 0.71). This suggests we should carry out the regression analysis using the usable fioorspace rather than the gross fioorspace as the independent variable. The larger correlation coefficient means that the points fit the line through the cloud of points more closely. Figure 4.6 shows a graph with the cost price on the X-axis, the usable fioorspace on the Y-axis and all data points that can be derived from Table 4.1.

Comparison of Figure 4.6 with Figure 4.2 shows that the points related to the usable fioorspace indeed fit the line more closely than the points related to the gross fioorspace.

To carry out a regression analysis the data analysis toolpak must be installed. To check whether this is the case, select the menu-item 'Tools/Add-Ins' and make sure that the list item 'Analysis Toolpak' is checked as shown in Figure 4.7.

If the toolpak is installed select the menu-item 'Tools/Data analysis' and then select 'Regression' from the list as shown in Figure 4.8. Usable floorspace

Figure 4.6 Graph showing a line that fits the data points

Figure 4.6 Graph showing a line that fits the data points

r Conditional Sum Wizard Browse f~ Euro Currency Tools —-—1—111 r Internet Assistant VESA f" Lookup Wizard r MS Query Add-in p ODBC Add-in ■mMMHHB -Analysis ToolPak—

Provides functions and interfaces for financial and scientific data analysis

Figure 4.7 Add-Ins dialog Figure 4.8

Regression option

Fte Edit View Insert Format lools Data Window WB! Help

; Arial T 10 T B I U m m m ^ \ % % * Toe i°e ii JP

A1 T fx SUMMARY OUTPUT

Fte Edit View Insert Format lools Data Window WB! Help

; Arial T 10 T B I U m m m ^ \ % % * Toe i°e ii JP

A1 T fx SUMMARY OUTPUT

 A I B C D E F G H I J K L - 1 SUMMAR OUTPUT 2 3 Regress/on Statistics 4 Multiple R 0.833656 5 R Square 0.694983 G Adjusted 0.684465 7 Standard 15751.02 8 Obseivatio 31 9 10 ANOVA F 11 df SS MS F :qnificance 1? Reg re s s io 1 1.64E+10 1.64E+1Û 66.07651 5.79E-09 13 Residual 29 7.19E+09 2.48E+08 14 Total 30 2.36E+10 15 16 Coefficientifandard Err t Stat P-value Lower 95% Upper 95%ower 95.09'pper 95.0°/ 18 Intercept X Variable -1735.1 906.0137 16178.4 111.458 -0.10725 8.128746 0.915331 -34823.6 31353.45 -34823.6 31353.45 5.79E-09 678.0565 1133.971 678.0565 1133.971 19 20 21 22 23 H < ► H |\ She et4/ Sheet / Sheet2 5heet3 /

Figure 4.9 Single regression analysis output in spreadsheet

This opens up a new window where you have to select the 'Input Y range' and the 'Input X range'. The 'Input Y range' is the range of cells containing the cost prices, in this case cells B2 through B32. The 'Input X range' is the range of cells containing the usable floorspace, in this case cells D2 through D32.

After adding these ranges, press the 'OK' button to carry out the analysis. This output shows the results on a new worksheet as shown in Figure 4.9

The equation to predict the deviation can be derived from the 'Intercept' (cell B17) and the 'X Variable' (cell B18). The equation is:

where:

y = cost price x = usable floorspace

With this formula any estimate of the cost price can be made. Figure 4.10 shows a graph with both the actual and estimated cost price per type of house.

The question arises: How 'good' is the resulting equation? Let us first look at plausibility. According to the equation an increase of one square metre floor space results in an increase of the cost price of € 906,- (units) which seems reasonable. The output shows that R2 = 0.69 (cell B5) so a fair amount of the variance in cost price is explained by the floorspace. We would expect

Actual vs. estimated costprice

200000 -

175000 -150000 -125000 -100000 -75000 -

O actual O estimated

1 2 3 4 5 6 7 8 9 1 0 11 12 1 3 1 4 1 5 1 6 1 7 1 8 1 9 20 21 22 23 24 25 26 27 28 29 30 31

Figure 4.10 Actual cost prices (left columns) compared to estimated cost prices (right columns)

however, that almost all variance in cost price could be explained. So a higher R2 could be obtainable. To this end we should incorporate more variables in the analysis. Statistical significance is in order as far as the X-variable is concerned, since the p-value is small and the lt-statisticl, being in the order of 8, exceeds the critical threshold.

Examining the data more closely, one can see that type 3A and 3AK houses have identical values where the usable floorspace is concerned but the cost prices differ. It can be noticed that this is true for all other types with a 'K' in them. This is a clear indication that the usable floorspace is probably just one of more independent variables that explain the cost price. The K-types turn out to be the types that are situated at the end of the blocks of houses. For that reason their surface area is larger, hence their facade is larger. This accounts for the fact that they have the same floorspace but higher costs as the facade is more costly than dividing walls. To allow for this, we have two options:

• Remove these types of houses from the input data set;

• Perform a multiple variable regression analysis (also taking the surface area of the facade into account).

The first option will result in an equation that will better fit the data points. Carry this analysis out for yourself and see how the value of R2 will increase. It comes at a price, however, because your prediction does not apply to houses situated at the end of a block.

A multiple variable regression analysis, also based on this example, is carried out in the next section. ## Best Investment Tips and Ideas

The dynamics of investing can be very emotional and stressful if not properly managed. When you are aware of what is all involved you give yourself the power to avoid those situations or at least manage them effectively. That will make your investments more exciting, rewarding, and enjoyable. Those positive factors will only lead to greater success in all that achieve with investments and life.

Get My Free Ebook