Authors: Nohman Akthtari, Alec Jacobs, Trevor Kapuvari | 12.03.2023

Course: Statistical And Data Mining Methods For Urban Spatial Analytics


INTRODUCTION


In our previous study, we employed Ordinary Least Squares (OLS) regression to examine the relationship between median home sales value by census block for homes in Philadelphia and four predictors: PCTBACHMOR (proportion of residents in a block group with at least a bachelor’s degree), PCTVACANT (proportion of housing units that are vacant), PCTSINGLE (percent of housing units that are detached single family houses), and LNNBELPOV (natural log of the number of households living in poverty). While OLS regression is a common and powerful tool, it assumes that the data is randomly distributed in space. However, in many real-world scenarios, data exhibits spatial autocorrelation, meaning that nearby observations are more likely to be similar than observations that are farther apart. This can lead to biased and inefficient OLS estimates.

To address this issue, we employ three alternative modeling approaches in this study: spatial lag, spatial error, and geographically weighted regression (GWR). Spatial lag models account for spatial autocorrelation by incorporating the values of the dependent variable from neighboring observations into the regression equation. Spatial error models, on the other hand, assume that the residuals, rather than the dependent variable, are spatially autocorrelated. GWR, unlike OLS and spatial lag and error models, allows the regression coefficients to vary across space, reflecting the potential heterogeneity in the relationships between the predictors and the dependent variable.

Our primary objective is to assess whether these alternative models can offer improved capabilities in accounting for spatial autocorrelation compared to OLS. We will compare the performance of the models based on their goodness-of-fit statistics, spatial autocorrelation diagnostics, and the interpretability of their coefficients. Additionally, we will examine how the spatial models compare to each other in terms of their ability to capture the spatial variation in home sales values.


METHODOLOGY


SPATIAL AUTOCORRELATION AND MORAN’S I


The guiding principal of spatial statistics can be summarized well by the following quote:

\[ \textit{"Everything is related to everything else, but near things are more related than distant things.”} \] This quote was provided by Swiss geographer Waldo Tobler and referred to as the First Law of Geography. The statistical translation behind Tobler’s law is spatial autocorrelation, which can be derived from temporal autocorrelation. Temporal correlation means that values of a variable tend to be similar for time periods close together. We can translate the same idea into space. Given an aerial division of a two-dimensional space in polygons, we measure whether neighboring (or “close”) locations have similar values with regard to our variable of interest or not. In case we observe such a trend, we speak of positive spatial autocorrelation. If not, we refer to the absence of spatial autocorrelation, or a negative spatial autocorrelation, indicating that neighboring (or “close”) values exhibit repulsion by being significantly different.

There are several measures of spatial autocorrelation but most widely accepted is Moran’s I. We first present the formula and its components and then present an explanation and interpretation.

Let there be an aerial polygon division of the observed two-dimensional space, and let \(X\) be the variable of interest. Moran’s I is then calculated as:

\[ I = \frac{1}{\sum_{i = 1}^{n} \sum_{j = 1}^{n}w_{ij}} \frac{\sum_{i=1}^{n}\sum_{j=1}^{n}w_{ij}(X_i - \overline{X})(X_j - \overline{X})}{\sum_{i = }^{n}(X_i - \overline{X})^2 / n}, \] where \(X_i, X_j\) are the observed values of the variable in polygons \(i,j\) respectively, \(\overline{X}\) is the mean of the variable, \(w_{ij}\) is the weight indexing of location \(i\) relative to location \(j\), and \(n\) is the number of observed aerial units.

\(w_{ij}\) is the \(ij-\)th entry in the contiguity weight matrix \(W \in M_{n \times n}(\mathbb{R}),\) where \(M_{n \times n}\mathbb{R}\) is the vector space of \(n \times n\) matrices with entries in \(\mathbb{R}.\) The contiguity weight matrix codes spatial proximity into spatial weights. Each contains all combinations of a given aerial unit with the \(n\) aerial units in question and the associated weights. It is best practice to row-standardize \(W\) because this will cancel out the first fraction in the equation above.

Whether locations \(i,j\) are neighbors depends on whether \(w_{ij}\) is non-zero, which in turn depends on the choice of type of \(W.\) whereas there are distance-based approaches for \(W,\) the standard literature usually prefers the so-called Queen contiguity matrix. The name derives from the allowed moves of the queen on a chessboard and visually translates into the notion that each location \(j\) that shares a border with location \(i\) is defined as a neighbor, implying \(w_{ij} \not= 0.\) Hence, non-neighboring locations are effectively cut out of the presented formula. For the rest of our study, we will employ the Queen contiguity method.

It becomes evident that statistical work employing Moran’s I could be sensitive on the choice of \(W\) to a degree of invalidating results. Therefore, many statisticians show the robustness of their results by demonstrating the results hold for other types of \(W,\) too. With the above explanations and the definition of the Pearson correlation coefficient in our previous assignment, we see that Moran’s I is a standardized number between \(-1\) and \(1\) measuring how much neighboring values of \(X\) co-vary relative to the overall variation in \(X\). \(X\) is the term in the denominator of the second fraction in the presented formula. A value close to 1 indicates strong positive spatial correlation, denoting clustering of values in nearby locations. A value close to -1 indicates negative spatial correlation. Thus, implying repulsion of values in nearby locations.


MORAN’S I SIGNIFICANCE TEST


A way of significance testing for Moran’s I is by randomly permuting the observed values of our variable \(X\) and thereby assigning each location \(i\) a potentially new value.

We first introduce the algorithm to obtain values for our Moran’s I significance test. Then, we introduce pseudo p-values and the significance test, and end by giving an interpretation on the employed method.

We denote our original Moran’s I value as \(\text{Moran}_{\text{initial}}.\) and follow with a random permutation - calculating Moran’s I for our first permutation. We then repeat this process another 998 times and rank the 999 test results in addition to \(\text{Moran}_{\text{initial}}\) in descending order. Now, let \(\text{rank}(x_i), i \in [0, 1000]\) return the relative position of the \(x_i-\)th permutation in the ordered set. Since there exists an \(i'\) such that \(x_i = \text{Moran}_{\text{initial}},\) we can then calculate the so-called pseudo p-value by computing

\[ p = \frac{\text{rank}(\text{Moran}_{\text{initial}})}{1000}. \]

For our hypothesis test, we have the following statements:

\[ H_0: \text{There is no spatial autocorrelation.} \]

\[ H_{a_1}: \text{There is positive spatial autocorrelation.} \]

\[ H_{a_2}: \text{There is negative spatial autocorrelation.} \]

In case the pseudo p-value is less than 0.05, we can reject the null hypothesis of no spatial autocorrelation for either \(H_{a_1}\) or \(H_{a_2}.\) In case \(\text{Moran}_{\text{initial}}\) is positive, we can reject \(H_{a_2}\) and will fail to reject \(H_{a_1},\) and vice versa.

In simple terms, given a set of values of \(X\) for each aerial unit for a polygon division of a two-dimensional space, random permutation recalculates Moran’s I with the same contiguity weight Matrix \(W\) for each permutation. This produces meaningful values for our significance test to work with since it shows the relative extremeness of \(\text{Moran}_{\text{initial}}.\) Our hypothesis test with the three presented outcomes is then evaluated via the pseudo p-value with the standard threshold of 0.05, where the pseudo p-value is the probability to observe a value that is equal to or more extreme than \(\text{Moran}_\text{initial}.\)

If the pseudo p-value is greater than that of our Moran’s I, we fail to reject \(H_0\). This implies a condition of no spatial autocorrelation. Conversely, if the pseudo p-value associated with our Moran’s I test is lower than the predetermined significance level (typically set at 0.05), we reject the null hypothesis \(H_0\) in favor of the alternative. This rejection indicates that we have statistically significant clustering in space.


LIMITATIONS OF MORAN’S I


For Moran’s I, in all of the above constructions, we assumed a given division of two-dimensional space. However, the value of Moran’s I is dependent on the choice of division. This means that Moran’s I is not canonical. In geographic terms, there is a modifiable aerial unit problem inherent in the measure. One popular application of this is the phenomenon of gerrymandering. As mentioned, the choice of \(W\) can also change the results of spatial analyses. Edge aerial units by definition don’t have a “full” neighborhood, a constraint that can significantly affect the results especially if the weights to the right of border aerial units are high in \(W.\) On the other hand, aerial units could have largely differing amounts of neighbors, again qualifying the generalizability of the measure.

Further, patterns such as the total division between high and low values between two halves of a map result in nearly no spatial autocorrelation, even though it practically is high. Only random shuffling significance tests would show us that something is off with the Moran’s I for these cases. Additionally, some choice of a random variable \(X\) suffer under the constraint of non-uniformity in space. For example, bank robberies can only occur in banks. Ultimately, Moran’s I is a global measure that does not reveal the source of a high or low value given we don’t know which areas of our space contribute to the degree of our Moran’s I. For this reason, it often makes sense to visualize local indices of spatial autocorrelation (LISA).


SPATIAL LAG


LISA

Conceptually, LISA is similar to Moran’s I but instead of measuring the global spatial autocorrelation, it measures the extent values of \(X\) at sites \(i\) are related to values of \(X\) at neighboring sites. The formula for LISA is given by

\[ I_i =\frac{1}{\sum_{j = 1}^{n}w_{ij}} \frac{(X_i - \overline{X}) \sum_{j=1}^{n}w_{ij}(X_j - \overline{X})}{(X_i - \overline{X})^2 / n}, \] and we see that the LISA statistic is proportional to Moran’s I in that the Moran’s I is the mean of the set \(I_i\) where \(i\) is defined such that all locations of our spatial division are included.


LISA SIGNIFICANCE TEST


The procedure for the LISA significance test is equal to the one for the Moran’s I metric, but is conducted for a location \(i\) instead of the whole data set. It is noteworthy that when we reshuffle for a particular location \(i,\) we do so for each of the \(j\) neighbors but not for \(i\) itself. Our hypotheses test are:

\[ H_{i_0}: \text{There is no spatial autocorrelation at location } i. \]

\[ H_{i_{a_1}}: \text{There is positive spatial autocorrelation at location } i. \]

\[ H_{i_{a_2}}: \text{There is negative spatial autocorrelation at location } i. \]

The interpretations are equal to those given for the Moran’s I value, but constrained to a location \(i.\)


OLS REGRESSION


OLS REGRESSION: EXPLANATION & ASSUMPTIONS


Multiple regression analysis is a statistical technique that inputs k independent variables to identify the unique surface in k+1 that minimizes the sum of squared residuals from our dependent variable observations. A residual is the shortest distance in k+1 dimensions from the best fit surface to the true observed data point in the dependent variable. Given that all assumptions for a Multiple Linear Regression are fulfilled, the core idea is that given our historic data on our k independent variables, the best fit surface will be a reliable and robust predictor for future values of the dependent variable. In standard statistical literature terminology, we say that we have a good regression model if our k independent variables ‘explain’ much of the variation, that is, if our best fit surface is close to the observation of the dependent variable for each observation.

In order to use Multiple Regression Analysis, we have to ensure that all assumptions are met:

  1. Linearity

  2. Normality of residuals

  3. Homoscedasticity of residuals

  4. No multicolinearity

  5. Continuity of dependent variable

  6. Residuals are random

  7. Assumption of independence

For a detailed explanation of OLS regression and its underlying assumptions, please refer to our previous assignment. In the results section of the aforementioned paper, we observed a common occurrence wherein data with spatial components exhibited clustering in the residuals. Notably, we consistently underestimated high logged median household values while overestimating low household values. Visualizing the residuals of the OLS regression often helps identify data with such spatial components. Below, a map of standardized residuals illustrating the clustering phenomenon:

The presence of clustered standardized errors signals the existence of an unaccounted spatial process. Simply put, the seemingly random noise conceals pertinent information crucial for evaluating the efficacy of our Ordinary Least Squares (OLS) regression. Consequently, the assumptions of homoscedasticity and independence of residuals are compromised, thereby violating key tenets of our regression analysis. Furthermore, the significance values associated with the \(\beta\) coefficients may be artificially inflated. An alternative method for detecting spatial autocorrelation involves regressing the residuals of the OLS model against the lagged neighboring residuals. This technique, clarified in the subsequent subsection, yields \(\beta\) coefficients termed as “b slope” for the lagged residuals.


SPATIAL LAG & SPATIAL ERROR REGRESSION


Several methods exist to address a spatial process in our dependent variable \(X.\) Initially, we will introduce spatial lag and spatial error, followed by an exploration of geographically weighted regression. All spatial lag and spatial error models will be executed using the R software.


MOTIVATION


A spatially lagged variable is the average of that value at those aerial units identified as neighbors by the contiguity weight matrix \(W.\) Applying this to the previous section, if we were to see a significant relationship between our residuals and lagged residuals, the assumption of independence of residuals would be violated. This also implies that \(\beta\) coefficients and their significance values may be inaccurate as they are inflated, which was discussed in the previous section. The goal of both spatial lag and spatial error regression is to remove these spatial dependencies in the residuals and to induce less heteroscedasticity.


ASSUMPTIONS & FORUMLAE


Both types of regressions, along with all OLS regression assumptions except for the independence of residuals, posit that the dependent variable in one location is linked to the values of that variable in nearby locations. We retain the assumption of independence of residuals, as it represents a factor we seek to address. Both models look to remedy this by incorporating an additional predictor.

We have given two tests for identify spatial dependencies, one visual and one that regresses residuals on lagged residuals. The reader might now ask himself if there are any fixed cut-off values after which one can be sure to use spatial lag or spatial error models. Whereas such a value doesn’t exist, we usually recommend running both OLS regression and spatial lag or spatial error regression if in doubt and compare the results on a number of criteria. We will discuss this in more detail in section “Spatial Lag vs OLS Regression” and “Spatial Error vs OLS Regression.”


SPATIAL LAG REGRESSION


Spatial lag ideally corrects the OLS model by including the lagged variable as an “independent variable” into the regression:

\[ y = \rho Wy + \beta_0 + \beta_1X_1 + \ldots + \beta_nX_n + \epsilon \] where \(\rho\) is the lag error parameter of the \(y-\)lag variable \(Wy.\) As such, \(\rho\) can take values between -1 and 1. As \(y\) is the output of our first OLS regression, spatial lag can be seen as a two-step regression model. Note: the new summand in the formula is a new predictor that ideally includes information of the spatial dependencies that previously were contained in the error term \(\epsilon.\) This, however, is not necessarily always the case. \(\rho\) can intuitively be interpreted as the coefficient to the \(y\)-lag variable \(Wy.\) The interpretation of \(\beta\) coefficients in the spatial lag model is complicated and not equivalent to that in spatial error and OLS regression. Therefore, we will just interpret the spatial lag depending on whether the value is positive or negative. Positive values indicate positive autocorrelation, and vice versa.


SPATIAL LAG VS OLS REGRESSION


In case our added corrective variable \(W_name\) is significant with \(p < 0.05,\) we can say that our dependent variable has a special process, as it demonstrates a significant correlation with adjacent values of the independent variable. It is essential to note that the inclusion of this corrective variable not only tends to alter parameters such as the intercept or coefficients but also affects the significance levels associated with each. In examining heteroscedasticity, which is linked to the independence of errors, we employ three testing methods: the White Test, the Breusch-Pagan Test, and the Koenker-Bassett Test. These tests help assess the presence of heteroscedasticity in our model, providing valuable insights into the variance structure of the residuals.

  1. White Test: This test serves as both a heteroscedasticity-consistent covariance matrix estimator and an evaluator of heteroscedasticity, without necessitating a predefined form for the alternative hypothesis regarding changing variances. In contrast to the Breusch-Pagan test, the White Test exhibits a broader applicability, making it a versatile tool for detecting and addressing heteroscedasticity in regression models. Its flexibility allows for a more comprehensive assessment of variance structures, contributing to the robustness of our statistical analysis.

  2. Breusch-Pagan Test: For the assessment of both heteroscedasticity and random coefficient variation, this test derives its test statistic by computing the squared residuals from the OLS regression model. In this process, the test scrutinizes the variance of the residuals by contrasting it with the variance predicted by a supplementary regression model. This auxiliary model utilizes the same independent variables employed in the original model to forecast the squared residuals. By comparing these variances, the test provides valuable insights into the presence and nature of heteroscedasticity and random coefficient variation within the regression framework, enhancing our understanding of the model’s performance.

  3. Koenker-Basset Test: A modified version of the Breusch-Pagan test, this variant considers the potential impact of individual data points on the assessment of heteroscedasticity. Notably, a high value for this statistic not only signals the presence of heteroscedasticity but also suggests a correlation between the variance of the residuals and the magnitude of the predicted values. This nuanced perspective enhances the diagnostic capability of the test, providing a more comprehensive understanding of the interplay between individual data points and the variability in the model’s predictions.


AKAIKE INFORMATION CRITERION (AIC) / SCHWARTZ CRITERION (SC)


AIC and SC are measures of the goodness of fit of an estimated statistical model. These measures act as relative indicators, quantifying the information lost when employing a particular model to represent reality. Essentially, they capture the delicate balance between the precision and complexity of the model, providing valuable insights into the model’s efficacy in capturing essential patterns while avoiding overfitting. The AIC is given by:

\[ \text{AIC} = -2\log (p(y \mid \omega_{ML}))+2K \] where y is some event and \(\omega_{ML}\) is the value that maximizes the likelihood function for \(y.\) Then \(p(y \mid \omega_{ML})\) is the maximized value of the likelihood function for the model. It becomes evident that when comparing models, more negative models have a better goodness of fit due to their greater maximized value of their likelihood function. The SC is related to the AIC but the AIC is usually preferred to predictive models whereas the SC is used for more parsimonious models. This is because the AIC favors models that that most adequately describe an unknown while the SC tries to find a true model in a set of candidates. Therefore, AIC and SC allow us to compare spatial lag with spatial error.


LOG LIKELIHOOD


The log likelihood method uses the maximum likelihood method of determining the best fitting data given the parameters of a statistical model, where a higher value indicates a better fit. We are only able to use log likelihood to compare nested models – the OLS regression model is a nested version of the spatial lag and spatial error regression models that does not include the spatial lag or spatial error terms. We can compare our OLS model to each spatial regression separately, but we are unable to compare the spatial regressions to each other using the log likelihood criteria.


LIKELIHOOD RATIO TEST (LRT)


The LRT assesses the goodness of fit between two nested models:

\[ H_0: \text{Log-Likelihood Standard OLS Reg. - Log-Likelihood Spatial Lag Reg. = 0.} \\ H_A: \text{Log-Likelihood Standard OLS Reg. - Log-Likelihood Spatial Lag Reg.} \not= 0. \] Models are nested if one two models have the same dependent variable but one of the model has the same and potentially more independent variables. As we see in the formula for spatial lag regression, we see that OLS regression is nested within spatial lag regression where \(\rho = 0.\) The likelihood function of a model is the probability that we actually observe the values of dependent variable \(X\) given our \(\beta\)-coefficients. The log-likelihood function is the logarithm of the likelihood function for a given model. Provided this information above, the hypotheses tests start to make sense since in case we reject \(H_0,\) this would imply that one set of coefficients is significantly better than the other in predicting the observed values of \(X.\)

Since adding the spatial process usually significantly increases the accuracy of a model (given that there is spatial autocorrelation), we usually reject \(H_0\) for \(p < 0.05\) and state that the spatial lag is a more predictive model than the standard OLS model. LRT can be seen an evaluation that assesses whether our additional variable actually improved the model in terms of accuracy or not. With regards to the spatial lag regression, the LRT is hence a test whether our new “predictor” \(\rho Wy\) absorbed some effect of the spatial process in the error term of the normal OLS regression. Note: LRT penalizes models that are more complex (e.g. have more independent variables) but yield the same goodness of fit as a nested model with fewer independent variables. Furthermore, we cannot compare spatial lag and spatial error regression with LRT since these models are not nested.


SPATIAL ERROR REGRESSION


Spatial error is a two-step regression model that ideally corrects the OLS model by splitting the error term \(\epsilon\) into random noise and the spatially lagged residual:

\[ y = \beta_0 + \beta_1X_1 + \ldots + \beta_nX_n + \lambda W \epsilon + \mu \] where \(\mu\) is the random noise, \(\lambda W \epsilon\) is the spatially lagged residual with the lag error parameter \(\lambda \in [-1, 1],\) and \(\epsilon = \lambda W \epsilon + \mu.\) We get \(\epsilon\) by regressing residuals on the residuals of neighbors as defined by \(W.\) This is the second step in the two-step regression process and in the best case extracts the spatial dependencies into the spatially lagged residual \(\lambda W \epsilon.\) Note that \(\lambda W \epsilon\) now is a predictor in our model. \(\lambda\) can intuitively be interpreted as the coefficient of the spatially lagged residual \(W \epsilon.\)

Our respective beta coefficients in the above formula are the solutions to the optimization problem in the multiple regression equation presented in our first analysis. There is exactly one beta coefficient for each independent variable. Beta coefficients are unit-free and the sign of the beta coefficient indicates whether relationship between the dependent and independent variable is positive or negative in the context of our regression model. Intuitively, a beta coefficients tell us by how much the value the dependent variable increases when the predictor \(i\) goes up by one unit and all other predictors are kept equal. As such, the beta coefficient can be thought of a strength relationship between dependent and independent variable in the context of our regression model. The \(\beta\) coefficients are statistically significant if the values fall below the p-threshold. Intuitively, the above also holds for the coefficient \(\lambda.\)


SPATIAL ERROR VS. OLS REGRESSION


Everything that we explained for Spatial Lag vs OLS Regression also holds for spatial error regression. Again, OLS regression is nested within spatial error regression with \(\lambda = 0.\)


CHOOSING BETWEEN SPATIAL LAG & ERROR REGRESSION


The choice of whether or not to use a spatial regression model depends on the outcome of the Lagrange Multiplier (LM) test. If neither the LM test for spatial lag nor the LM test for spatial error is significant, then OLS regression is sufficient, and no spatial regression model is needed. However, if one or both of the LM tests are significant, then a spatial regression model should be used.

If only one of the LM tests is significant, then the spatial regression model type corresponding to that significant test should be used. For example, if the LM test for spatial lag is significant but the LM test for spatial error is not, then a spatial lag model should be used.

If both the LM test for spatial lag and the LM test for spatial error are significant, then the decision of which spatial regression model type to use is based on the Robust LM values and probabilities. The spatial regression model type with the lower p-value or higher test statistic should be chosen.


GEOGRAPHICALLY WEIGHTED REGRESSION


For spatial lag and spatial error regressions as well as for the OLS regression, we have one consequential constraint - all of our predictors are globally stationary given that regardless of the location \(i,\) we apply the same coefficients to estimate a value for our dependent variable \(X.\) The idea that the modeled relations are constant throughout space (e.g. the study area) is called spatial stationarity, which does not always hold. One likely consequence of spatial stationarity could be spatial clustering in the residuals as we saw in the previous paper, assuming that close neighborhood effects are a more accurate and significant predictor of nearby values. This notion is also consistent with Tobler’s First Law of Geography.

A good way to illustrate the above is Simpson’s Paradox. Simpson’s Paradox is a phenomenon in statistics in which a trend appears in several groups but disappears or reverses when aggregating all groups into one. The reason this can happen is because even though some parts of the data might have a very clear trend, the aggregation process can even out many of these effects or even reverse them. This is something that likely happens to spatially constant coefficients and might ultimately lead to spatial clustering. One way to avoid these issues altogether is using local regression. Local regression runs a OLS regression for each location \(i\) and points in the data set that are considered “spatially close”. We will later see that we define this notion of spatial closeness through bandwidths. In GWR, data from locations farther away are given less weight, thereby increasing the effect of direct neighbors on the regression estimates for location \(i.\) Therefore, location regression is specific for a location \(i\) and employs the idea behind Tobler’s First Law of Geography.


GWR FORMULA & ASSUMPTIONS


The equation for GWR is:

\[ y_i = \sum_{k = 1}^{m} \beta_{ik}x_{ik}+ \epsilon_i \] where \(\beta_{ik}\) is the \(i\)-th location specific \(\beta\)-coefficient, \(x_{ik}\) is a data point that is considered spatially close to \(i,\) and \(\epsilon_i\) is the \(i\)-th location specific residual. GWR has the same assumptions has spatial lag and spatial error. Ideally, the data set includes at least 300 observations.


FIXED BANDWIDTH & ADAPTIVE BANDWIDTH


In local regression, the concept of “spatially close” is crucial for determining the weights assigned to neighboring observations when estimating the value at a specific location \(i\). This definition of spatial proximity is typically represented by a bandwidth, which defines the maximum distance or the number of neighboring observations considered when fitting the local regression model.

Two primary approaches exist for defining the bandwidth: fixed and adaptive. A fixed bandwidth specifies a predetermined distance threshold for all locations. Observations falling within this distance are considered neighbors and contribute to the local regression at the target location. Conversely, an adaptive bandwidth dynamically adjusts the distance threshold based on the local density of observations. This approach ensures that a consistent number of neighbors are considered for each location, regardless of the overall spatial distribution of data points.

The choice of bandwidth type, whether fixed or adaptive, can significantly impact the local regression outcome. A fixed bandwidth may lead to overfitting or underfitting in areas with sparse or dense data points, respectively. On the other hand, an adaptive bandwidth can adapt to varying data densities, potentially providing more consistent results across the study area.

In addition to influencing the spatial extent of neighboring observations, the bandwidth also determines the weighting function used in local regression. The weighting function assigns decreasing weights to observations as their distance from the target location increases. The specific form of the weighting function depends on the chosen bandwidth type and the desired smoothness of the local regression estimate.

Ultimately, the selection of an appropriate bandwidth is a crucial aspect of local regression modeling. The choice between fixed and adaptive bandwidths depends on the specific characteristics of the data and the research question being addressed. Careful consideration of the bandwidth’s impact on both the spatial extent of neighbors and the weighting function is essential for obtaining reliable and interpretable local regression results.

The fixed bandwidth technique has a Gaussian distribution, and one way to define it would be

\[ w_{ij} = e^{-0.5(\frac{\text{distance}_{ij}}{h})^2}. \] where \(h\) is the bandwidth distance and \(\text{distance}_{ij}\) is the distance between location \(i,j.\) A Gaussian distribution makes sense here since we don’t discriminate locations by the number of data points in the neighborhood of a point and therefore apply the same weighting distribution to each location. For a variable bandwith, the weighting function should be such that its distribution attributes higher weights than the Gaussian distribution to close points but \(w_{ij}\) rapidly decreases as distance increases. This is important since we do not know in advance the number of close data points we can capture and therefore have to give disproportionately high weights to close neighbors. If we do not do this, we would allow potentially very far points to have disproportionately high impact on our prediction.


GWR LIMITATIONS


One of the big limitations of GWR is that since we work with subsets of the dependent variable \(X,\) we are more likely to encounter local multicolinearity for our local regressions. Note that local multicolinearity does not imply global multicolinearity. Also, smaller sample sizes make it more difficult to detect a significant relationship between explanatory variables an a dependent variable. Additionally, explanatory variables in local regression are at risk insignificance since they are likely to have a very uniform distribution for the area within a bandwidth. We also have the multiple test problem in that we would need extremely many tests to determine significance of coefficients. With the multiple test problem also comes the issue of that we will have many Type I errors, even with low thresholds. The multiple test problem is also why we don’t have p-values associated to GWR.


RESULTS


HISTOGRAMS / LOG TRANSFORMATIONS


The provided histograms illustrate the distribution of all raw variables: MEDHVAL, PCBACHMORE, PCTVACANT, PCTSINGLES, NBELPOV100, and MEDHHINC. It is crucial to observe that variables such as PCTBACHMOR, PCTVACANT, PCTSINGLES, and NBELPOV100 exhibit a skewed distribution - indicating the need for log transformation before conducting additional analyses. The skewed distribution of these variables may impact the assumptions of normality in statistical analyses. Log transformation is a common practice to address skewness, helping to stabilize variances and improve the adherence of the data to normal distribution assumptions. Additionally, this transformation aids in mitigating the influence of extreme values, contributing to more robust and reliable analyses. It is also essential to ensure that the data conforms to the assumptions of the chosen statistical methods, as deviations from normality can affect the accuracy of parameter estimates and hypothesis testing. Therefore, applying log transformation to these variables is an important step in preparing the data for rigorous and valid statistical analysis. Note: certain regression techniques yield more valid results when applied to normally distributed data. Therefore, transforming these variables using a logarithmic scale is recommended to enhance the accuracy and reliability of subsequent analytical procedures.

par(oma=c(0,0,2,0)) 
par(mfrow=c(2,3)) 
hist(shp$PCTBACHMOR, breaks = 40, col = "#FF5500",
     main = "% of Pop with
     a Bachelor Degree or Higher",
     xlab = "PCTBACHMOR",
     border = "white",
     lwd = 0.1)
hist(shp$MEDHHINC, breaks = 50,col = "#FF5500",
     main = "Median Household Income",
     xlab = "MEDHHINC",
     border = "white",
     lwd = 0.1)
hist(shp$PCTVACANT, breaks = 40,col = "#FF5500",
     main = "% Units Marked as Vacant",
     xlab = "PCTVACANT",
    border = "white",
     lwd = 0.1)
hist(shp$PCTSINGLES, breaks = 40, col = "#FF5500",
     main = "% of Units Marked 
     as Single Family",
     xlab = "PCTSINGLES",
     border = "white",
     lwd = 0.1)
hist(shp$MEDHVAL, breaks = 50,col = "#FF5500",
     main = "Median House Value",
     xlab = "MEDHVAL",
     border = "white",
     lwd = 0.1)
hist(shp$NBelPov100, breaks = 50,col = "#FF5500",
     main = "# of Households Below Poverty",
     xlab = "NBelPov100",
     border = "white",
     lwd = 0.1)


HISTOGRAMS / LOG TRANSFORMED VARIABLES


Before proceeding with the transformation, it is crucial to identify variables with a minimum value of 0. Specifically, PCTBACHMOR, PCTVACANT, PCTSINGLES, and NBELPOV100 fall into this category. To ensure the accuracy of the transformation process, it is essential to add 1 to these variables that reach the minimum value of 0 before applying any further transformations. Hence, PCTBACHMOR, PCTVACANT, PCTSINGLES, and NBELPOV100, all with minimum values of 0, must be increased by 1 prior to undergoing the transformation process. This precautionary step is critical for preventing undefined results and ensuring the integrity of subsequent analytical procedures. The minimum values include PCTBACHMOR: 0, MEDHHINC: 2499, PCTVACANT: 0, PCTSINGLES: 0, MEDHVAL: 10000, BBELPOV100:0. The adjustment of variables with a minimum value of 0 by adding 1 is a standard practice in data preprocessing.

shp$Log_MEDHHINC <- log(shp$MEDHHINC)
shp$Log_MEDHVAL <- log(shp$MEDHVAL)

#log transformation +1
shp$Log_PCTBACHMOR <- log(shp$PCTBACHMOR + 1)
shp$Log_PCTVACANT <- log(shp$PCTVACANT + 1)
shp$Log_PCTSINGLES <- log(shp$PCTSINGLES + 1)
shp$Log_NBelPov100<- log(shp$NBelPov100 + 1)

par(oma=c(0,0,2,0)) 
par(mfrow=c(2,3)) 
hist(shp$Log_PCTBACHMOR, breaks = 40, col = "#FF5500",
     main = "Log Trans. of % Pop 
     with Bach. Degree or Higher",
     xlab = "LNPCTBACHMOR",
     border = "white",
     lwd = 0.1)
hist(shp$Log_MEDHHINC, breaks = 50,col = "#FF5500",
     main = "Log Trans. of 
     Median Household Income",
     xlab = "LNMEDHHINC",
     border = "white",
     lwd = 0.1)
hist(shp$Log_PCTVACANT, breaks = 40,col = "#FF5500",
     main = "Log Trans. of 
     % Units marked as Vacant",
     xlab = "LNPCTVACANT",
     border = "white",
     lwd = 0.1)
hist(shp$Log_PCTSINGLES, breaks = 40, col = "#FF5500",
     main = "Log Trans. of % Units
     Marked as Single Family",
     xlab = "LNPCTSINGLES",
     border = "white",
     lwd = 0.1)
hist(shp$Log_MEDHVAL, breaks = 50,col = "#FF5500",
     main = "Log Trans. of the
     Median House Value",
     xlab = "LNMEDHVAL",
     border = "white",
     lwd = 0.1)
hist(shp$Log_NBelPov100, breaks = 50,col = "#FF5500",
     main = "Log Trans. of the
     # of HH Below Povery",
     xlab = "LNNBelPov100",
     border = "white",
     lwd = 0.1)


DEFINING NEIGHBOR RELATIONSHIPS


The premise of spatial autocorrelation centers around Tobler’s First Law of Geography. Articulated by Waldo Tobler in 1969, the law posits “everything is related to everything else, but near things are more related than distant things.” It suggests objects/phenomena that are geographically close to one another are more likely to be similar, or have a spatial relationship when compared objects that are farther apart. The first law of geography is the foundation of the fundamental concepts of spatial dependence and spatial autocorrelation - utilized specifically for the inverse distance weighting method of spatial interpolation and supports the regionalized variable theory for kriging. Ultimately, Tobler’s law is a fundamental assumption applied in all spatial analysis.

Therefore, we begin by defining neighbors for each of the block groups in Philadelphia. A polygon (in this case, block group) is considered to have a “Queen Neighbor Relationship” to another polygon if they share either a side and/or a vertex. This is compared to a “Rook Neighbor Relationship,” where two block groups only share a side. Neighbors based on contiguity are constructed by assuming that neighbors of a given area are other areas that share a common boundary. Neighbors can be of type Queen if a single shared boundary point meets the contiguity condition, or Rook if more than one shared point is required to meet the contiguity condition. We’ll prepare a weight matrix based on Queen neighbors, and visualize the connections between neighbors based on the weight matrix.


# Producing Queen Neighbor Relationship Outputs

queen<-poly2nb(shp, row.names=shp$POLY_ID)

summary(queen)
## Neighbour list object:
## Number of regions: 1720 
## Number of nonzero links: 10526 
## Percentage nonzero weights: 0.3558004 
## Average number of links: 6.119767 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  18  27 
##   4  16  52 175 348 493 344 177  62  28  10   4   2   1   1   1   1   1 
## 4 least connected regions:
## 441 708 1391 1665 with 1 link
## 1 most connected region:
## 1636 with 27 links
# Visualize Connections Between Neighbors Based On Weight Matrix

reg_data_lines = nb2lines(queen, 
                          coords = st_centroid(shp$geometry), 
                          as_sf = TRUE)

tm_shape(shp) + 
  tm_borders(col = 'grey80', lwd = 0.5) + 
tm_shape(shp) +
  tm_dots() +
tm_shape(reg_data_lines) +
  tm_lines(col = '#FF5800') +
tm_layout(frame = FALSE)

# Visualizing Which Block Groups Have Only One Neighbor

smallestnbcard<-card(queen) 
smallestnb<-which(smallestnbcard == min(smallestnbcard))
fg<-rep('grey95', length(smallestnbcard))
fg[smallestnb]<-'#FF5800'
fg[queen[[smallestnb[1]]]]<-"#E6872D"
fg[queen[[smallestnb[2]]]]<-"#E6872D"
fg[queen[[smallestnb[3]]]]<-"#E6872D"
fg[queen[[smallestnb[4]]]]<-"#E6872D"
plot(shp$geometry, col=fg, lwd=0.5, border='grey80')
title(main='BLOCK GROUPS WITH ONLY 01 NEIGHBOR')

# Visualizing Which Block Groups Have The Most Neighbors

largestnbcard<-card(queen)
largestnb<-which(largestnbcard == max(largestnbcard))
fg1<-rep('grey95', length(largestnbcard))
fg1[largestnb]<-'#FF5800'
fg1[queen[[largestnb]]]<-"#E6872D"
plot(shp$geometry, col=fg1, lwd=0.5, border='grey80')
title(main='BLOCK GROUPS WITH 27 NEIGHBORS')


GLOBAL MORAN’S I


A calculated global Moran’s I value of 0.79 for the dependent variable LNMEDHVAL indicates a high degree of spatial autocorrelation. This is further corroborated by comparing this value against the Moran’s I values derived from 999 random permutations. The histogram clearly demonstrates that the value of Moran’s I for LNMEDHVAL of 0.79 falls significantly outside the range of values obtained from the 999 random permutations. Therefore, we can confidently conclude that LNMEDHVAL exhibits substantial spatial autocorrelation, compelling us to reject the null hypothesis of no spatial autocorrelation.

# Calculate Global Moran's I: Measure Of Spatial Autocorrelation

queenlist<-nb2listw(queen, style = 'W')
moran(shp$Log_MEDHVAL, 
      queenlist, 
      n=length(queenlist$neighbours),
      S0=Szero(queenlist))$`I`
## [1] 0.7935638
# Reshuffling Test With 999 Permutations + Present Moran's I Value

moranMC<-moran.mc(shp$Log_MEDHHINC, queenlist, nsim=999, alternative="two.sided")

moranMC
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  shp$Log_MEDHHINC 
## weights: queenlist  
## number of simulations + 1: 1000 
## 
## statistic = 0.55246, observed rank = 1000, p-value <
## 0.00000000000000022
## alternative hypothesis: two.sided
moranMCres <- moranMC$res
hist(
  moranMCres,
  breaks = 100,
  freq = 10000000,
  main = "HISTOGRAM OF MORAN'S I",
  xlab = "MORAN'S I VALUES", 
  ylab = "FREQUENCY",
  col = '#FF5800',  
  border = "white")

abline(v=moran(shp$Log_MEDHHINC, queenlist, n=length(queenlist$neighbours), S0=Szero(queenlist))$`I`, col='#FF5800')

To further substantiate the significance of our Moran’s I value, we examined the correlation between the natural logarithm of median housing values (LNMEDHHINC) in block groups and their neighboring counterparts. As evident in the plot below, a discernible pattern emerged, reinforcing the notion of a meaningful relationship between block group observations and those of their neighbors. The absence of such a relationship would have resulted in an indistinct pattern, contrary to our findings. The observed clear patterns signifying a noteworthy spatial association between LNMEDHHINC values, underscoring the validity of Moran’s I as a measure for evaluating spatial autocorrelation within our dataset.

moran.plot(shp$Log_MEDHHINC, queenlist, xlab = "LNMEDHHINC", ylab = "SPATIAL LAG") 


LOCAL MORAN’S I


The following maps unveil unique patterns in the spatial distribution of the variable LNMEDHVAL. When the local Moran’s I p-value falls below 0.05, significance is attributed to spatial clusters. The Local Indicator of Spatial Association (LISA) is a helpful tool to produce a P-value that can be interpreted alongside a spatial association cluster map. The following results show the spatial relationship of Median House Value (MEDHVAL) across block groups while the association map’s values are as follows:

  • High-high: The polygon and its neighbors contain particularly high observed values.
  • High-low: The polygon has a high observed value, though its neighbors are significantly lower in value.
  • Low-high: The polygon has a low observed value, though its neighbors are significantly lower in value.
  • Low-low: The polygon and its neighbors contain particularly high observed values.

An examination of the spatial distribution of Median House Value (MEDHVAL) in Philadelphia reveals patterns and notable clusters. Upon initial observation, significant high-high clusters emerge in Center City, Society Hill, Rittenhouse Square, Roxborough, and sections of Northeast Philadelphia. These areas are characterized by elevated median house values, surrounded by similarly high values, indicating spatial concentrations of affluence. In stark contrast, noteworthy low-low clusters are prevalent in North Philadelphia, Kensington, and Southwest Philadelphia. These neighborhoods exhibit consistently low median house values, with neighboring areas also displaying similarly low values, suggesting spatial concentrations of lower socioeconomic status.

Interestingly, areas exhibiting high-low clustering encompass parts of East Falls and localized regions along the Delaware River in Northeast Philadelphia. These areas showcase a contrast of high median house values in certain pockets, interspersed with neighboring areas featuring low values, implying the presence of distinct socioeconomic disparities within these communities. Meanwhile, locales with low-high clustering include Navy Yard and Mill Creek. These areas exhibit low median house values, surrounded by neighboring areas with high values, suggesting the potential for gentrification or revitalization efforts in these communities.

Notable areas deemed not significant encompass a substantial portion of South Philadelphia, Kensington, and West Philadelphia. These areas lack identifiable spatial clustering or outliers, implying a relatively uniform spatial pattern of median house values. This suggests a more balanced socioeconomic distribution in these neighborhoods. Ultimately, the spatial distribution of MEDHVAL in Philadelphia highlights the diverse socioeconomic landscape of the city. The presence of high-high clusters underscores areas of affluence and concentrated wealth, while low-low clusters indicate zones of lower socioeconomic status. High-low and low-high clusters reveal pockets of socioeconomic disparity and potential gentrification trends. These patterns underscore the importance of targeted development and investment strategies to address socioeconomic disparities and promote equitable access to housing opportunities across Philadelphia.

tm_shape(shp) + 
  tm_polygons(title = "LN(MEDHVAL)", col = "LNMEDHVAL", border.col = "white", border.lwd = 0.5, palette = paletteV6, style = "jenks") + 
  tm_layout(main.title = "MEDIAN HOUSE VALUES",
            legend.position = c("right", "bottom"),
            frame = FALSE)

# Calculate Local Moran's I: Measure Of Spatial Autocorrelation

LISA<-localmoran(shp$Log_MEDHVAL, queenlist)

head(LISA)
##          Ii            E.Ii       Var.Ii     Z.Ii  Pr(z != E(Ii))
## 1 5.3518753 -0.003049782428 1.3051239240 4.687352 0.0000027676336
## 2 4.4121912 -0.002216570245 0.7590386437 5.066879 0.0000004043918
## 3 3.5006316 -0.003049782428 0.7444805191 4.060672 0.0000489316331
## 4 2.4444508 -0.000843876267 0.2410035985 4.981033 0.0000006324567
## 5 1.8834699 -0.001094165811 0.6259058436 2.382080 0.0172151443052
## 6 0.0995037 -0.000001608295 0.0009210143 3.278786 0.0010425472857
# Mapping The P-Values We Calculated

df.LISA <- cbind(shp, as.data.frame(LISA))

moranSig.plot <- function(df, listw, title) {
  local <- localmoran(x = df$Log_MEDHVAL, 
                      listw = listw, 
                      zero.policy = FALSE)
  moran.map <- cbind(df, local)
  tm <- tm_shape(moran.map) +
    tm_borders(col = 'white') +
    tm_fill(style = 'fixed',
            col = 'Pr.z....E.Ii..', 
            breaks = c(0, 0.001, 0.01, 0.05, 1), 
            title = 'P-VALUES', 
            palette = paletteV3) +
    tm_layout(frame = FALSE, 
              title = title)
  print(tm)
}

moranSig.plot(df.LISA, 
              queenlist, 
              'P-VALUES MEDHVAL')

# Creating A Spatial Association Map

hl.plot<-function(df, listw){
  local<-localmoran(x=df$Log_MEDHVAL, 
                    listw=listw, 
                    zero.policy = FALSE
                    )
  quadrant<-vector(mode='numeric', 
                   length=323
                   )
  m.prop<-df$Log_MEDHVAL - mean(df$Log_MEDHVAL)
  m.local<-local[,1]-mean(local[,1])
  signif<-0.05
  quadrant[m.prop >0 & m.local>0]<-4 #high MEDHHINC, high clustering
  quadrant[m.prop <0 & m.local<0]<-1 #low MEDHHINC, low clustering
  quadrant[m.prop <0 & m.local>0]<-2 #low MEDHINC, high clustering
  quadrant[m.prop >0 & m.local<0]<-3 #high MEDHHINC, low clustering
  quadrant[local[,5]>signif]<-0
  
  brks <- c(0,1,2,3,4)
  colors <- c("#FCF4EE","#CAE4F1","#21505A","#FFA500","#FF5500")
  plot<-plot(shp,
             border="gray90",
             bty = "n",
              lwd=0.25,
             col=colors[findInterval(quadrant,brks,all.inside=FALSE)])}

hl.plot(shp, queenlist)
legend("bottomright",
       legend=c("insignificant","low-high","low-low","high-low","high-high"),
       fill=c("#FCF4EE", "#CAE4F1", "#21505A", "#FFA500", "#FF5500"),
       bty="n", cex = .8)


OLS REGRESSION


Ordinary Least Squares (OLS) Regression serves as a fundamental method for exploring and understanding the intricate relationships between a set of independent variables and a dependent variable. In the context of the analysis at hand, the four independent variables — namely Log_NBelPov100, PCTBACHMOR, PCTSINGLES, and PCTVACANT — are under scrutiny in relation to the dependent variable, Log_MEDHVAL. The OLS Regression method involves constructing a model that endeavors to capture the underlying patterns and dependencies within the dataset. This process entails finding the line (or hyperplane in higher dimensions) that minimizes the sum of the squared differences between the observed values and the values predicted by the model. Consequently, the analysis involves not only the formulation of this model but also a thorough examination of its parameters, coefficients, and statistical significance.

The objective is to discern the extent to which each independent variable contributes to explaining the variance in the dependent variable. Through this analytical lens, OLS Regression provides valuable insights into the dynamics and significance of the relationships between the specified independent variables and the target variable, thereby aiding in the interpretation and understanding of the overall data patterns. The outcomes of the OLS regression analysis revealed that poverty levels, education status, home vacancy rates, and single-resident homes serve as statistically significant predictors for the natural logarithm of median home values in Philadelphia. The model achieved an 𝑅² value of 0.6623, indicating that it accounts for 66.23% of the variance in median home values.

# Regress The Log Of Median House Value On Four Predictors

reg<-lm(formula=Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, data=shp)

summary(reg)
## 
## Call:
## lm(formula = Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + PCTSINGLES + 
##     PCTVACANT, data = shp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25825 -0.20391  0.03822  0.21744  2.24347 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    11.1137661  0.0465330 238.836 < 0.0000000000000002 ***
## Log_NBelPov100 -0.0789054  0.0084569  -9.330 < 0.0000000000000002 ***
## PCTBACHMOR      0.0209098  0.0005432  38.494 < 0.0000000000000002 ***
## PCTSINGLES      0.0029769  0.0007032   4.234            0.0000242 ***
## PCTVACANT      -0.0191569  0.0009779 -19.590 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared:  0.6623, Adjusted R-squared:  0.6615 
## F-statistic: 840.9 on 4 and 1715 DF,  p-value: < 0.00000000000000022
standardised<-rstandard(reg)
resnb<-sapply(queen, function(x) mean(standardised[x]))

shp$standardised <- standardised
OLS.Residuals.Map<-tm_shape(shp)+
  tm_fill(col='standardised', 
          style='quantile', 
          palette = paletteV2, 
          midpoint = 0,
          #breaks = c(0, 1, 2, 4, 6 )
)  +
  tm_layout(frame=FALSE)

OLS.Residuals.Map

# Print The Natural Logarithm Likelihood # COME BACK TO!

logLik(reg)
## 'log Lik.' -711.5376 (df=6)


HETEROSCEDASTICITY TESTS


The following results for the three heteroscedasticity tests, Breusch-Pagan, Koenker-Bassett, and White’s tests, consistently indicate a strong rejection of the null hypothesis of homoscedasticity in the regression model. The p-values for all three tests are significantly below the common alpha level of 0.05, suggesting that the variance of the residuals is not constant across observations. The Koenker-Bassett test further corroborates this finding with a high-test statistic of 43.94, providing compelling evidence of heteroscedasticity. These findings collectively lead to the rejection of the null hypothesis, implying that the variance of the model’s residuals is not constant and instead exhibits heteroscedasticity.

# Testing Heteroscedasticity With The Breusch-Pagan Test

bptest(reg, studentize=FALSE)
## 
##  Breusch-Pagan test
## 
## data:  reg
## BP = 113.19, df = 4, p-value < 0.00000000000000022
# Testing Heteroscedasticity With The Koenker-Bassett Test

bptest(reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 42.869, df = 4, p-value = 0.00000001102
# Testing Heteroscedasticity With The White Test

white_test(reg)
## White's test results
## 
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 43.94
## P-value: 0

We run the Jarque-Bera test to assess whether the residuals of our regression are normal. The Jarque-Bera test reveals that the assumption of normally distributed regression errors is not met. The p-value for the Jarque-Bera test is extremely low, falling well below the standard significance level (p < 0.05), indicating that our residuals are not normally distributed. This small p-value compels the rejection of the null hypothesis that the residuals adhere to a normal distribution. This finding contradicts the histogram of residuals presented in our first assingment on OLS regression. This discrepancy can be attributed to the limitations and inherent subjectivity of visual interpretations of histogram results.

# Testing Heteroscedasticity With The Jarque-Bera Test

jarque.bera.test(reg$residuals)
## 
##  Jarque Bera Test
## 
## data:  reg$residuals
## X-squared = 778.95, df = 2, p-value < 0.00000000000000022


MORAN’S I SIGNIFICANCE


The provided output shows the residuals of the Ordinary Least Squares (OLS) Model, with a notable feature being the standardization of these residuals achieved by dividing them by an estimate of their standard deviation. This standardized representation offers a clearer perspective on the magnitude of the residuals in relation to their expected variability. However, upon a careful examination of the results, a discernible pattern emerges, suggesting the presence of autocorrelation within the residuals. Autocorrelation, in this context, indicates a systematic correlation between the residuals at different points in the dataset, potentially indicating that the model has not fully captured some underlying temporal or sequential patterns present in the data. Identifying and addressing autocorrelation is crucial for refining the model’s accuracy and reliability, as it ensures that the residuals exhibit a random distribution, and any discernible patterns are a result of genuine relationships between the variables rather than systematic errors in the modeling process.

res.lm <- lm(formula=standardised ~ resnb)

summary(res.lm)
## 
## Call:
## lm(formula = standardised ~ resnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3685 -0.4450  0.0585  0.4618  5.4434 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -0.01281    0.02121  -0.604               0.546    
## resnb        0.73235    0.03244  22.576 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8793 on 1718 degrees of freedom
## Multiple R-squared:  0.2288, Adjusted R-squared:  0.2283 
## F-statistic: 509.7 on 1 and 1718 DF,  p-value: < 0.00000000000000022

The analysis of standardized residuals regressed against spatially lagged residuals unveils a notable beta coefficient (slope b) of 0.312. This implies that a one-unit change in the lagged residual corresponds to a 0.312-unit change in the standardized residual. This observation strongly suggests the presence of spatial autocorrelation, indicating a tendency for nearby residuals to demonstrate similar values. A beta coefficient significantly departing from zero is indicative of statistically significant spatial autocorrelation within the model’s residuals. In our examination, the p-value associated with the beta coefficient (slope b) approaches zero, leading us to infer a statistically significant positive correlation between residuals and the spatial lag of residuals. Consequently, it is apparent that spatial autocorrelation manifests in the ordinary least squares (OLS) residuals.

moran.mc(standardised, queenlist, 999, alternative="two.sided")
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  standardised 
## weights: queenlist  
## number of simulations + 1: 1000 
## 
## statistic = 0.3124, observed rank = 1000, p-value < 0.00000000000000022
## alternative hypothesis: two.sided
moran.plot(standardised, queenlist, title = "Moran Scatterplot",
           xlab = "SPATIALLY LAGGED STANDARDIZED RESIDUALS", ylab = "STANDARDIZED RESIDUALS")


SPATIAL LAG + SPATIAL ERROR REGRESSSION


SPATIAL LAG REGRESSION OUTPUTS


The spatial lag regression model revealed a distinct spatial dependence among median home values, with a (\(\rho\)) value of 0.65 and a highly significant p-value associated with the spatial lag term. This implies that median home values in a particular area are strongly influenced by those in neighboring areas. Further analysis of independent variables, including LNNBELPOV, PCTBACHMOR, PCTSINGLES, and PCTVACANT, demonstrated their sustained significance as predictors of median home values, consistent with the Ordinary Least Squares (OLS) model. Interestingly, the spatial lag model’s intercept value was approximately one-third that of the OLS model, and the coefficients for LNNBELPOV, PCTBACHMOR, and PCTSINGLES were slightly lower, while the coefficient for PCTVACANT was slightly higher. Additionally, the Breusch-Pagan test for the spatial lag model yielded a p-value approaching zero, indicating persistent heteroscedasticity in the model’s residuals, as this p-value was less than 0.05.

To evaluate the relative fit of the OLS and spatial lag models, we compared their AIC/SC, log likelihood, and likelihood ratio test values. The spatial lag model exhibited a significantly lower AIC value (525) compared to the OLS model (1435), suggesting a superior fit based on this metric. Similarly, the spatial lag model’s log likelihood value (−256) outperformed the OLS model’s value (−711), further supporting its improved fit. Moreover, the likelihood ratio test produced a p-value close to zero, leading to the rejection of the null hypothesis that the spatial lag model is not a better fit than the OLS model. Consequently, based on the likelihood ratio test, the spatial lag model is deemed a superior fit for the data.

lagreg<-lagsarlm(formula=Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, data=shp, queenlist)

summary(lagreg)
## 
## Call:lagsarlm(formula = Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + 
##     PCTSINGLES + PCTVACANT, data = shp, listw = queenlist)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.655464 -0.117249  0.018654  0.133132  1.726460 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                   Estimate  Std. Error  z value              Pr(>|z|)
## (Intercept)     3.89844052  0.20111373  19.3843 < 0.00000000000000022
## Log_NBelPov100 -0.03405540  0.00629303  -5.4116         0.00000006246
## PCTBACHMOR      0.00851392  0.00052195  16.3119 < 0.00000000000000022
## PCTSINGLES      0.00203340  0.00051578   3.9424         0.00008068395
## PCTVACANT      -0.00852964  0.00074369 -11.4694 < 0.00000000000000022
## 
## Rho: 0.6511, LR test value: 911.51, p-value: < 0.000000000000000222
## Asymptotic standard error: 0.01805
##     z-value: 36.072, p-value: < 0.000000000000000222
## Wald statistic: 1301.2, p-value: < 0.000000000000000222
## 
## Log likelihood: -255.7844 for lag model
## ML residual variance (sigma squared): 0.071952, (sigma: 0.26824)
## Number of observations: 1720 
## Number of parameters estimated: 7 
## AIC: 525.57, (AIC for lm: 1435.1)
## LM test for residual autocorrelation
## test value: 67.739, p-value: 0.00000000000000022204
# "LAGREG" Is The SL Output While "REG" Is The OLS Output

LR.Sarlm(lagreg, reg)
## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = 911.51, df = 1, p-value < 0.00000000000000022
## sample estimates:
## Log likelihood of lagreg    Log likelihood of reg 
##                -255.7844                -711.5376
# Testing Heteroscedasticity With The Breusch-Pagan Test

bptest.Sarlm(lagreg, studentize=FALSE)
## 
##  Breusch-Pagan test
## 
## data:  
## BP = 210.76, df = 4, p-value < 0.00000000000000022
# Testing Heteroscedasticity With The Koenker-Bassett Test

bptest.Sarlm(lagreg)
## 
##  studentized Breusch-Pagan test
## 
## data:  
## BP = 51.412, df = 4, p-value = 0.0000000001831
# Using The Jarque-Bera Test Assess Whether Residuals Are Normal

jarque.bera.test(lagreg$residuals)
## 
##  Jarque Bera Test
## 
## data:  lagreg$residuals
## X-squared = 2756.8, df = 2, p-value < 0.00000000000000022

The spatial lag model effectively addresses spatial autocorrelation in the data, as evidenced by its Moran’s I value of -0.08 (p-value = 0.002), which indicates negative spatial autocorrelation in the residuals. This represents a significant improvement over the OLS model’s Moran’s I value of 0.31. Moreover, the spatial lag model’s superior performance is assured by its lower AIC value; higher log-likelihood value; statistically significant likelihood ratio test; and lower Moran’s I value compared to the OLS model.

shp <- shp |>
  mutate(
    spatial_lag_residuals = residuals(lagreg))

tm_shape(shp) +
  tm_fill(col = 'spatial_lag_residuals', 
          style = 'quantile', 
          title = 'Spatial Lag Residuals', 
          palette = 'Oranges',
          midpoint = 0) +
  tm_layout(frame = FALSE)

shp$nb <- poly2nb(shp, queen = TRUE)

error_residuals_vector <- as.vector(shp$spatial_lag_residuals)

moran.plot(shp$spatial_lag_residuals, nb2listw(shp$nb),
           xlab = "Spatial Lag Residuals", 
           ylab = "Lagged Spatial Lag Residuals")

The presented output sheds light on the Monte Carlo simulation of Moran’s I, a computational approach used to assess spatial autocorrelation significance in a dataset. This method facilitates an exploration of spatial relationships, providing a quantitative measure of how similar values cluster or disperse across space. The calculated Monte Carlo value serves as an indicator of observed autocorrelation strength, where a larger, more positive value suggests heightened similarity among neighboring locations beyond random chance. In this case, the result implies the presence of some degree of autocorrelation, prompting further investigation into spatial patterns and potential implications for the analytical model’s reliability. This underscores the need to consider spatial relationships in data interpretation and refine the analysis to address spatial dependencies that could impact result validity.

The spatial lag model effectively addresses spatial autocorrelation in the data, as indicated by its Moran’s I value of -0.08 (p-value = 0.002), signifying negative spatial autocorrelation in the residuals. This marks a notable enhancement over the Moran’s I value of 0.31 observed in the OLS model. Additionally, the spatial lag model demonstrates superior performance, supported by a lower AIC value, a higher log-likelihood value, a statistically significant likelihood ratio test, and a lower Moran’s I value compared to the OLS model. Unfortunately, Breusch-Pagan test reveals persistent heteroscedasticity.

reslag<-lagreg$residuals

lagMoranMc<-moran.mc(reslag, queenlist, 999, alternative="two.sided")

lagMoranMc
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  reslag 
## weights: queenlist  
## number of simulations + 1: 1000 
## 
## statistic = -0.082412, observed rank = 1, p-value = 0.002
## alternative hypothesis: two.sided


SPATIAL ERROR REGRESSION OUTPUTS


The spatial error model outperforms the OLS model in explaining the variation in median home values. The presence of a significant spatial error term (lambda = 0.81, p-value near 0) indicates that the residuals of median home values in an area are spatially correlated. Additionally, the spatial error model retains the significance of the independent variables (LNNBELPOV, PCTBACHMOR, PCTSINGLES, PCTVACANT) and exhibits improved goodness-of-fit measures, as evidenced by lower AIC (755) and higher log-likelihood (-3734) values compared to the OLS model (AIC = 1435, log-likelihood = -711). The likelihood ratio test further supports the superiority of the spatial error model. Despite these advantages, heteroscedasticity persists in the spatial error model residuals.

errreg<-errorsarlm(formula=Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, data=shp, queenlist)
reserr<-residuals(errreg)
errresnb<-sapply(queen, function(x) mean(reserr[x]))
summary(errreg)
## 
## Call:errorsarlm(formula = Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + 
##     PCTSINGLES + PCTVACANT, data = shp, listw = queenlist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92652 -0.11541  0.01489  0.13386  1.94867 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                   Estimate  Std. Error  z value              Pr(>|z|)
## (Intercept)    10.90641386  0.05346907 203.9761 < 0.00000000000000022
## Log_NBelPov100 -0.03453480  0.00708952  -4.8713      0.00000110894233
## PCTBACHMOR      0.00981310  0.00072898  13.4614 < 0.00000000000000022
## PCTSINGLES      0.00267790  0.00062085   4.3133      0.00001608343393
## PCTVACANT      -0.00578324  0.00088672  -6.5220      0.00000000006936
## 
## Lambda: 0.81492, LR test value: 677.61, p-value: < 0.000000000000000222
## Asymptotic standard error: 0.016373
##     z-value: 49.772, p-value: < 0.000000000000000222
## Wald statistic: 2477.2, p-value: < 0.000000000000000222
## 
## Log likelihood: -372.7338 for error model
## ML residual variance (sigma squared): 0.076555, (sigma: 0.27669)
## Number of observations: 1720 
## Number of parameters estimated: 7 
## AIC: NA (not available for weighted model), (AIC for lm: 1435.1)
# "ERRREG" Is The SE Output While "REG" Is The OLS Output

LR.Sarlm(errreg, reg)
## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = 677.61, df = 1, p-value < 0.00000000000000022
## sample estimates:
## Log likelihood of errreg    Log likelihood of reg 
##                -372.7338                -711.5376
# Testing Heteroscedasticity With The Breusch-Pagan Test

bptest.Sarlm(errreg, studentize=FALSE)
## 
##  Breusch-Pagan test
## 
## data:  
## BP = 23.213, df = 4, p-value = 0.0001148
# Testing Heteroscedasticity With The Koenker-Bassett Test

bptest.Sarlm(errreg)
## 
##  studentized Breusch-Pagan test
## 
## data:  
## BP = 5.1627, df = 4, p-value = 0.271
# Using The Jarque-Bera Test Assess Whether Residuals Are Normal

jarque.bera.test(errreg$residuals)
## 
##  Jarque Bera Test
## 
## data:  errreg$residuals
## X-squared = 3506.9, df = 2, p-value < 0.00000000000000022

The Moran’s I value of -0.09 for the spatial error model suggests the presence of negative spatial autocorrelation in the residuals, which is statistically significant as indicated by a p-value less than 0.05. A comparison with OLS model with a Moran’s I of 0.31 reveals the superior performance of the spatial error model, as evidenced by its lower AIC value; higher log-likelihood value; statistically significant likelihood ratio test; and reduced Moran’s I value. The Breusch-Pagan test, however, indicates the persistence of heteroscedasticity in the model’s residuals, requiring further investigation.

shp <- shp |>
  mutate(
    spatial_error_residuals = reserr)

tm_shape(shp) +
  tm_fill(col = 'spatial_error_residuals', 
          style = 'quantile', 
          title = 'Spatial Error Residuals', 
          palette = 'Oranges',
          midpoint = 0) +
  tm_layout(frame = FALSE)

shp$nb <- poly2nb(shp, queen = TRUE)

error_residuals_vector <- as.vector(shp$spatial_error_residuals)

moran.plot(shp$spatial_error_residuals, nb2listw(shp$nb),
           xlab = "Spatial Error Residuals", 
           ylab = "Lagged Spatial Error Residuals")

errMoranMc<-moran.mc(reserr, queenlist, 999, alternative="two.sided")

errMoranMc
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  reserr 
## weights: queenlist  
## number of simulations + 1: 1000 
## 
## statistic = -0.094532, observed rank = 1, p-value = 0.002
## alternative hypothesis: two.sided


GEOGRAPHICALLY WEIGHTED REGRESSION


Geographically Weighted Regression (GWR) serves as a valuable method for exploring the intricate relationships between variables while preserving sensitivity to spatial patterns. The pivotal factor in GWR lies in the concept of “bandwidth,” denoting the distance or range over which neighboring data points influence the regression model. There are two primary types of bandwidth: Adaptive and Fixed.

Adaptive bandwidth proves to be the optimal choice when dealing with sporadically distributed points, such as block groups exhibiting irregular patterns across the geographical landscape. On the other hand, fixed bandwidth becomes more advantageous when observations exhibit a greater degree of spatial uniformity.

To enhance the precision of GWR, the analysis incorporates the calculation of optimal adaptive and fixed bandwidths. This determination is achieved through the utilization of two critical metrics: the Akaike Information Criterion and Gaussian Weights. These metrics play a crucial role in fine-tuning the bandwidth parameters, ensuring that the GWR model aligns with the underlying spatial characteristics of the data, thereby enhancing the accuracy of the regression analysis.

Overall, the GWR analysis output reveals a value for the quasi-global 𝑅2 indicating a strong goodness of fit when regressing median home values against our predictors. In comparison, the OLS regression yields a lower 𝑅2 value of 0.6623, suggesting that the GWR model better explains the variance in the natural log of median home sale values by block group.


AKAIKE INFORMATION CRITERION


shps <- as(shp, 'Spatial')

class (shps)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
bw_adaptive<-gwr.sel(formula=Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, 
            data=shps,
            method = "aic",
            adapt = TRUE)
## Bandwidth: 0.381966 AIC: 1278.726 
## Bandwidth: 0.618034 AIC: 1334.073 
## Bandwidth: 0.236068 AIC: 1209.53 
## Bandwidth: 0.145898 AIC: 1115.876 
## Bandwidth: 0.09016994 AIC: 1007.348 
## Bandwidth: 0.05572809 AIC: 910.4307 
## Bandwidth: 0.03444185 AIC: 821.2906 
## Bandwidth: 0.02128624 AIC: 737.6009 
## Bandwidth: 0.01315562 AIC: 681.6086 
## Bandwidth: 0.008130619 AIC: 660.8787 
## Bandwidth: 0.005024999 AIC: 714.2598 
## Bandwidth: 0.009856269 AIC: 667.0857 
## Bandwidth: 0.006944377 AIC: 667.59 
## Bandwidth: 0.008427569 AIC: 661.757 
## Bandwidth: 0.007677515 AIC: 663.6788 
## Bandwidth: 0.008171309 AIC: 660.931 
## Bandwidth: 0.00805268 AIC: 661.144 
## Bandwidth: 0.008130619 AIC: 660.8787
bw_adaptive
## [1] 0.008130619
shps <- as(shp, 'Spatial')

class (shps)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
bw_fixed<-gwr.sel(formula=Log_MEDHVAL ~ Log_NBelPov100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, 
            data=shps,
            method = "aic",
            adapt = FALSE)
## Bandwidth: 47374.26 AIC: 1380.178 
## Bandwidth: 76576.63 AIC: 1412.408 
## Bandwidth: 29326.2 AIC: 1314.513 
## Bandwidth: 18171.89 AIC: 1205.472 
## Bandwidth: 11278.14 AIC: 1056.874 
## Bandwidth: 7017.572 AIC: 904.1873 
## Bandwidth: 4384.396 AIC: 773.8975 
## Bandwidth: 2757.003 AIC: 701.36 
## Bandwidth: 1751.22 AIC: 921 
## Bandwidth: 3378.612 AIC: 715.0242 
## Bandwidth: 2578.467 AIC: 707.8216 
## Bandwidth: 2916.634 AIC: 700.6484 
## Bandwidth: 2860.57 AIC: 700.4427 
## Bandwidth: 2865.223 AIC: 700.4423 
## Bandwidth: 2863.527 AIC: 700.4421 
## Bandwidth: 2863.506 AIC: 700.4421 
## Bandwidth: 2863.504 AIC: 700.4421 
## Bandwidth: 2863.504 AIC: 700.4421 
## Bandwidth: 2862.383 AIC: 700.4421 
## Bandwidth: 2863.076 AIC: 700.4421 
## Bandwidth: 2863.341 AIC: 700.4421 
## Bandwidth: 2863.442 AIC: 700.4421 
## Bandwidth: 2863.48 AIC: 700.4421 
## Bandwidth: 2863.495 AIC: 700.4421 
## Bandwidth: 2863.501 AIC: 700.4421 
## Bandwidth: 2863.503 AIC: 700.4421 
## Bandwidth: 2863.504 AIC: 700.4421 
## Bandwidth: 2863.503 AIC: 700.4421 
## Bandwidth: 2863.504 AIC: 700.4421 
## Bandwidth: 2863.504 AIC: 700.4421
bw_fixed
## [1] 2863.504


GAUSSIAN WEIGHTS


shps <- as(shp, 'Spatial')

class (shps)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
gwrmodel<-gwr(formula=Log_MEDHHINC~Log_MEDHVAL+PCTVACANT,
              data=shps,
              adapt = bw_adaptive, #adaptive bandwidth determined by proportion of observations accounted for
              gweight=gwr.Gauss,
              se.fit=TRUE, #to return local standard errors
              hatmatrix = TRUE)
gwrmodel
## Call:
## gwr(formula = Log_MEDHHINC ~ Log_MEDHVAL + PCTVACANT, data = shps, 
##     gweight = gwr.Gauss, adapt = bw_adaptive, hatmatrix = TRUE, 
##     se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.008130619 (about 13 of 1720 data points)
## Summary of GWR coefficient estimates at data points:
##                     Min.     1st Qu.      Median     3rd Qu.        Max.
## X.Intercept. -1.62492384  4.30769937  5.73537240  7.19240449 10.94460805
## Log_MEDHVAL  -0.07446270  0.28221101  0.41806650  0.55212360  1.08455590
## PCTVACANT    -0.08581501 -0.00973303 -0.00473283  0.00028396  0.02121667
##               Global
## X.Intercept.  5.6508
## Log_MEDHVAL   0.4316
## PCTVACANT    -0.0092
## Number of data points: 1720 
## Effective number of parameters (residual: 2traceS - traceS'S): 218.4288 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 1501.571 
## Sigma (residual: 2traceS - traceS'S): 0.3187979 
## Effective number of parameters (model: traceS): 153.2121 
## Effective degrees of freedom (model: traceS): 1566.788 
## Sigma (model: traceS): 0.3120925 
## Sigma (ML): 0.2978683 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 1053.968 
## AIC (GWR p. 96, eq. 4.22): 868.1632 
## Residual sum of squares: 152.6079 
## Quasi-global R2: 0.6323028
shps <- as(shp, 'Spatial')

class (shps)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
gwrmodel_fixed<-gwr(formula=Log_MEDHHINC~Log_MEDHVAL+PCTVACANT,
              data=shps,
              bandwidth = bw_fixed,
              gweight=gwr.Gauss,
              se.fit=TRUE,
              hatmatrix = TRUE)

gwrmodel_fixed
## Call:
## gwr(formula = Log_MEDHHINC ~ Log_MEDHVAL + PCTVACANT, data = shps, 
##     bandwidth = bw_fixed, gweight = gwr.Gauss, hatmatrix = TRUE, 
##     se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Fixed bandwidth: 2863.504 
## Summary of GWR coefficient estimates at data points:
##                      Min.      1st Qu.       Median      3rd Qu.         Max.
## X.Intercept. -11.60425222   4.20981344   5.72170937   7.15230843  12.57363749
## Log_MEDHVAL   -0.16280443   0.28512824   0.41809354   0.56436662   1.92344072
## PCTVACANT     -0.15049402  -0.00885470  -0.00473414  -0.00051821   0.06389959
##               Global
## X.Intercept.  5.6508
## Log_MEDHVAL   0.4316
## PCTVACANT    -0.0092
## Number of data points: 1720 
## Effective number of parameters (residual: 2traceS - traceS'S): 219.3262 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 1500.674 
## Sigma (residual: 2traceS - traceS'S): 0.322455 
## Effective number of parameters (model: traceS): 157.5513 
## Effective degrees of freedom (model: traceS): 1562.449 
## Sigma (model: traceS): 0.3160162 
## Sigma (ML): 0.3011951 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 1102.685 
## AIC (GWR p. 96, eq. 4.22): 910.7106 
## Residual sum of squares: 156.0359 
## Quasi-global R2: 0.6240434


MORAN’S I HISTOGRAM & SCATTERPLOT / GWR RESIDUALS


The Moran’s I scatter plot of GWR residuals suggests a lower spatial autocorrelation compared to OLS Regression, Spatial Lag, or Spatial Error models. The global Moran’s I value for GWR residuals is 0.03, indicating clustering. However, comparison with random permutations suggests a slight chance of random clustering, supported by a p-value of 0.014, providing a confidence that the clustering is not solely due to chance. In contrast to OLS residuals, GWR residuals exhibit less clustering, and their Moran’s I statistic is the lowest among the models, indicating the least spatial autocorrelation. Further, the spatial Lag and spatial Error models show negative Moran’s I values, indicating spatial dispersion.



LOCAL GWR OUTPUTS


summary(gwrmodel$SDF)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x 2660604.8 2750171.3
## y  207610.6  304858.8
## Is projected: NA 
## proj4string : [NA]
## Data attributes:
##      sum.w        X.Intercept.     Log_MEDHVAL         PCTVACANT        
##  Min.   :16.03   Min.   :-1.625   Min.   :-0.07446   Min.   :-0.085815  
##  1st Qu.:24.47   1st Qu.: 4.308   1st Qu.: 0.28221   1st Qu.:-0.009733  
##  Median :26.64   Median : 5.735   Median : 0.41807   Median :-0.004733  
##  Mean   :27.48   Mean   : 5.629   Mean   : 0.42757   Mean   :-0.005790  
##  3rd Qu.:29.45   3rd Qu.: 7.192   3rd Qu.: 0.55212   3rd Qu.: 0.000284  
##  Max.   :86.70   Max.   :10.945   Max.   : 1.08456   Max.   : 0.021217  
##  X.Intercept._se  Log_MEDHVAL_se     PCTVACANT_se          gwr.e          
##  Min.   :0.4056   Min.   :0.03587   Min.   :0.002266   Min.   :-2.135873  
##  1st Qu.:0.9944   1st Qu.:0.08859   1st Qu.:0.004812   1st Qu.:-0.138334  
##  Median :1.4246   Median :0.12996   Median :0.006458   Median : 0.024808  
##  Mean   :1.5779   Mean   :0.14253   Mean   :0.007720   Mean   :-0.001443  
##  3rd Qu.:2.0049   3rd Qu.:0.18220   3rd Qu.:0.009196   3rd Qu.: 0.175134  
##  Max.   :6.2484   Max.   :0.54488   Max.   :0.030986   Max.   : 1.216632  
##       pred           pred.se           localR2        X.Intercept._se_EDF
##  Min.   : 9.204   Min.   :0.02809   Min.   :0.05656   Min.   :0.4143     
##  1st Qu.: 9.981   1st Qu.:0.05094   1st Qu.:0.24148   1st Qu.:1.0158     
##  Median :10.231   Median :0.06077   Median :0.34087   Median :1.4552     
##  Mean   :10.245   Mean   :0.06645   Mean   :0.35582   Mean   :1.6118     
##  3rd Qu.:10.521   3rd Qu.:0.07619   3rd Qu.:0.45532   3rd Qu.:2.0480     
##  Max.   :12.171   Max.   :0.25205   Max.   :0.75141   Max.   :6.3826     
##  Log_MEDHVAL_se_EDF PCTVACANT_se_EDF     pred.se.1      
##  Min.   :0.03664    Min.   :0.002315   Min.   :0.02870  
##  1st Qu.:0.09050    1st Qu.:0.004916   1st Qu.:0.05203  
##  Median :0.13275    Median :0.006597   Median :0.06208  
##  Mean   :0.14559    Mean   :0.007886   Mean   :0.06788  
##  3rd Qu.:0.18612    3rd Qu.:0.009394   3rd Qu.:0.07782  
##  Max.   :0.55659    Max.   :0.031652   Max.   :0.25747


LOCAL GWR RESIDUAL OUTPUTS


The presented maps illustrate standardized coefficients, derived from dividing beta coefficients by the standard errors of our four dependent variables. The greater the absolute value of the ratio between the coefficient and the standard error, the higher the likelihood that the relationship between the predictor and the dependent variable is significant in that area of Philadelphia. A standardized coefficient exceeding two suggests a potential local positive correlation between the predictor and the natural log of median home sales value, while a coefficient below negative two indicates a potential significant negative relationship. Further analysis of the maps shows a possible positive correlation between the percentage of households with bachelor’s degrees and the natural log of median home sale value across Philadelphia.

There is also a potential statistically significant negative association between the percentage of vacant houses and the natural log of median home sale value in various parts of the city - including Northwest Philadelphia, South Philadelphia, and areas of West Philadelphia. Further, large sections of Northeast and Northwest Philadelphia display a potentially significant positive relationship between the percentage of single-family houses and the natural log of median home values, a trend less prevalent in parts of the city with fewer single-family homes. In summary, the visualized standardized coefficients provide valuable insights into potential relationships between predictors and dependent variables across Philadelphia, suggesting local variations in correlation strengths and depicting spatial patterns.


LOCAL GWR R-SQUARED MAP


The map of local 𝑅2 values obtained through GWR regression offers insights into the variable effectiveness of our localized regression model in clarifying the fluctuations in median home sales values. This spatial arrangement highlights an uneven dispersion of 𝑅2 values throughout the city, featuring prominent clusters of elevated values in northwest Philadelphia and diminished values in northern areas adjacent to the Center City. Higher 𝑅2 values within regions signify a more robust interpretation of the variation. Noteworthy is the prevalence of 𝑅2 values surpassing 0.5 in South Philadelphia, Northwest Philadelphia, and Northeast Philadelphia, indicating the model’s adeptness in explaining the variability of the dependent variable in those locales. Conversely, specific zones in North Philadelphia and West Philadelphia showcase localized 𝑅2 values below 0.2, underscoring the model’s constraints in explaining the variation in median home sale values within those particular locations.

gwrresults<-as.data.frame(gwrmodel$SDF)
shp$coefLog_MEDHVALst<-gwrresults$Log_MEDHVAL/gwrresults$Log_MEDHVAL_se
shp$coefPCTVACANTst<-gwrresults$PCTVACANT/gwrresults$PCTVACANT_se

shp$gwrE<-gwrresults$gwr.e
shp$localR2<-gwrresults$localR2

coefLog_MEDHVAL<-tm_shape(shp)+
  tm_fill(col='coefLNMEDHVALst', 
          breaks=c(-Inf, -6, -4, -2, 0, 2, 4, 6, Inf), 
          title='Standardized coefficient of LNMEDHVAL', 
          palette ='-RdBu')+
  tm_layout(frame=FALSE, 
            title = 'Median House Value (Log)')

coefPCTVACANT<-tm_shape(shp)+
  tm_fill(col='coefPCTVACANTst', 
          breaks=c(-Inf, -6, -4, -2, 0, 2, 4, 6, Inf), 
          title='Standardized coefficient of PCTVACANT', 
          palette='-RdBu')+
  tm_layout(frame=FALSE, 
            title = 'Percentage of Housing Vacant')

tm_shape(shp)+
  tm_fill(col='localR2',  
          breaks=c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7), 
          n=5, 
          palette = "Oranges")+
  tm_layout(frame=FALSE)


DISCUSSION


Spatial regression techniques provide a more sophisticated and robust framework for modeling spatial data compared to traditional Ordinary Least Squares (OLS) regression. This study employed a suite of spatial regression techniques, including spatial lag regression, spatial error regression, and geographically weighted regression (GWR), to analyze the spatial distribution of median home values in Philadelphia. The results consistently demonstrated the superiority of GWR, with the lowest AIC value and Moran’s I value closest to zero, indicating its effectiveness in capturing spatial autocorrelation patterns.

Despite its overall effectiveness, GWR’s performance was not uniform across all neighborhoods. While certain predictors consistently explained variations in median home values, others were less effective, particularly in areas with a large minority population, such as North and West Philadelphia. This suggests that the underlying factors influencing home values may vary across different spatial contexts. Additionally, GWR’s local regressions were susceptible to multicollinearity, where highly correlated predictors, such as the proportion of residents with at least a bachelor’s degree and the proportion of single-family homes, could mask the true relationship between home values and their associated factors.

Weighted residuals, also known as spatially lagged residuals, differ from spatial lag residuals in their consideration of spatial relationships between neighboring observations. Weighted residuals incorporate a weighted average of neighboring residuals, reflecting the spatial structure of the data and providing a more nuanced interpretation of spatial autocorrelation patterns. Maintaining precision in the use of these terms is crucial to avoid ambiguity and ensure clarity in spatial data analysis.

In conclusion, spatial regression techniques, particularly GWR, offer a more nuanced and robust approach to modeling spatial data compared to OLS regression. However, it is crucial to acknowledge the limitations of GWR, such as its susceptibility to multicollinearity and its potential for reduced performance in specific areas. While ArcGIS Pro is a widely used GIS software, it is not recommended for GWR due to its unclear methodologies and inconsistent results. In contrast, R provides a more robust and reliable platform for GWR modeling. It is advisable to explore alternative software platforms specifically designed for GWR analysis to ensure a more comprehensive and accurate spatial regression analysis, as demonstrated in this study of spatial regression employed to predict median house values across block groups in Philadelphia.