Authors: Nohman Akthtari, Alec Jacobs, Trevor Kapuvari | 12.03.2023
Course: Statistical And Data Mining Methods For Urban Spatial Analytics
In our previous study, we employed Ordinary Least Squares (OLS) regression to examine the relationship between median home sales value by census block for homes in Philadelphia and four predictors: PCTBACHMOR (proportion of residents in a block group with at least a bachelor’s degree), PCTVACANT (proportion of housing units that are vacant), PCTSINGLE (percent of housing units that are detached single family houses), and LNNBELPOV (natural log of the number of households living in poverty). While OLS regression is a common and powerful tool, it assumes that the data is randomly distributed in space. However, in many real-world scenarios, data exhibits spatial autocorrelation, meaning that nearby observations are more likely to be similar than observations that are farther apart. This can lead to biased and inefficient OLS estimates.
To address this issue, we employ three alternative modeling approaches in this study: spatial lag, spatial error, and geographically weighted regression (GWR). Spatial lag models account for spatial autocorrelation by incorporating the values of the dependent variable from neighboring observations into the regression equation. Spatial error models, on the other hand, assume that the residuals, rather than the dependent variable, are spatially autocorrelated. GWR, unlike OLS and spatial lag and error models, allows the regression coefficients to vary across space, reflecting the potential heterogeneity in the relationships between the predictors and the dependent variable.
Our primary objective is to assess whether these alternative models can offer improved capabilities in accounting for spatial autocorrelation compared to OLS. We will compare the performance of the models based on their goodness-of-fit statistics, spatial autocorrelation diagnostics, and the interpretability of their coefficients. Additionally, we will examine how the spatial models compare to each other in terms of their ability to capture the spatial variation in home sales values.
The guiding principal of spatial statistics can be summarized well by the following quote:
\[ \textit{"Everything is related to everything else, but near things are more related than distant things.”} \] This quote was provided by Swiss geographer Waldo Tobler and referred to as the First Law of Geography. The statistical translation behind Tobler’s law is spatial autocorrelation, which can be derived from temporal autocorrelation. Temporal correlation means that values of a variable tend to be similar for time periods close together. We can translate the same idea into space. Given an aerial division of a two-dimensional space in polygons, we measure whether neighboring (or “close”) locations have similar values with regard to our variable of interest or not. In case we observe such a trend, we speak of positive spatial autocorrelation. If not, we refer to the absence of spatial autocorrelation, or a negative spatial autocorrelation, indicating that neighboring (or “close”) values exhibit repulsion by being significantly different.
There are several measures of spatial autocorrelation but most widely accepted is Moran’s I. We first present the formula and its components and then present an explanation and interpretation.
Let there be an aerial polygon division of the observed two-dimensional space, and let \(X\) be the variable of interest. Moran’s I is then calculated as:
\[ I = \frac{1}{\sum_{i = 1}^{n} \sum_{j = 1}^{n}w_{ij}} \frac{\sum_{i=1}^{n}\sum_{j=1}^{n}w_{ij}(X_i - \overline{X})(X_j - \overline{X})}{\sum_{i = }^{n}(X_i - \overline{X})^2 / n}, \] where \(X_i, X_j\) are the observed values of the variable in polygons \(i,j\) respectively, \(\overline{X}\) is the mean of the variable, \(w_{ij}\) is the weight indexing of location \(i\) relative to location \(j\), and \(n\) is the number of observed aerial units.
\(w_{ij}\) is the \(ij-\)th entry in the contiguity weight matrix \(W \in M_{n \times n}(\mathbb{R}),\) where \(M_{n \times n}\mathbb{R}\) is the vector space of \(n \times n\) matrices with entries in \(\mathbb{R}.\) The contiguity weight matrix codes spatial proximity into spatial weights. Each contains all combinations of a given aerial unit with the \(n\) aerial units in question and the associated weights. It is best practice to row-standardize \(W\) because this will cancel out the first fraction in the equation above.
Whether locations \(i,j\) are neighbors depends on whether \(w_{ij}\) is non-zero, which in turn depends on the choice of type of \(W.\) whereas there are distance-based approaches for \(W,\) the standard literature usually prefers the so-called Queen contiguity matrix. The name derives from the allowed moves of the queen on a chessboard and visually translates into the notion that each location \(j\) that shares a border with location \(i\) is defined as a neighbor, implying \(w_{ij} \not= 0.\) Hence, non-neighboring locations are effectively cut out of the presented formula. For the rest of our study, we will employ the Queen contiguity method.
It becomes evident that statistical work employing Moran’s I could be sensitive on the choice of \(W\) to a degree of invalidating results. Therefore, many statisticians show the robustness of their results by demonstrating the results hold for other types of \(W,\) too. With the above explanations and the definition of the Pearson correlation coefficient in our previous assignment, we see that Moran’s I is a standardized number between \(-1\) and \(1\) measuring how much neighboring values of \(X\) co-vary relative to the overall variation in \(X\). \(X\) is the term in the denominator of the second fraction in the presented formula. A value close to 1 indicates strong positive spatial correlation, denoting clustering of values in nearby locations. A value close to -1 indicates negative spatial correlation. Thus, implying repulsion of values in nearby locations.
A way of significance testing for Moran’s I is by randomly permuting the observed values of our variable \(X\) and thereby assigning each location \(i\) a potentially new value.
We first introduce the algorithm to obtain values for our Moran’s I significance test. Then, we introduce pseudo p-values and the significance test, and end by giving an interpretation on the employed method.
We denote our original Moran’s I value as \(\text{Moran}_{\text{initial}}.\) and follow with a random permutation - calculating Moran’s I for our first permutation. We then repeat this process another 998 times and rank the 999 test results in addition to \(\text{Moran}_{\text{initial}}\) in descending order. Now, let \(\text{rank}(x_i), i \in [0, 1000]\) return the relative position of the \(x_i-\)th permutation in the ordered set. Since there exists an \(i'\) such that \(x_i = \text{Moran}_{\text{initial}},\) we can then calculate the so-called pseudo p-value by computing
\[ p = \frac{\text{rank}(\text{Moran}_{\text{initial}})}{1000}. \]
For our hypothesis test, we have the following statements:
\[ H_0: \text{There is no spatial autocorrelation.} \]
\[ H_{a_1}: \text{There is positive spatial autocorrelation.} \]
\[ H_{a_2}: \text{There is negative spatial autocorrelation.} \]
In case the pseudo p-value is less than 0.05, we can reject the null hypothesis of no spatial autocorrelation for either \(H_{a_1}\) or \(H_{a_2}.\) In case \(\text{Moran}_{\text{initial}}\) is positive, we can reject \(H_{a_2}\) and will fail to reject \(H_{a_1},\) and vice versa.
In simple terms, given a set of values of \(X\) for each aerial unit for a polygon division of a two-dimensional space, random permutation recalculates Moran’s I with the same contiguity weight Matrix \(W\) for each permutation. This produces meaningful values for our significance test to work with since it shows the relative extremeness of \(\text{Moran}_{\text{initial}}.\) Our hypothesis test with the three presented outcomes is then evaluated via the pseudo p-value with the standard threshold of 0.05, where the pseudo p-value is the probability to observe a value that is equal to or more extreme than \(\text{Moran}_\text{initial}.\)
If the pseudo p-value is greater than that of our Moran’s I, we fail to reject \(H_0\). This implies a condition of no spatial autocorrelation. Conversely, if the pseudo p-value associated with our Moran’s I test is lower than the predetermined significance level (typically set at 0.05), we reject the null hypothesis \(H_0\) in favor of the alternative. This rejection indicates that we have statistically significant clustering in space.
For Moran’s I, in all of the above constructions, we assumed a given division of two-dimensional space. However, the value of Moran’s I is dependent on the choice of division. This means that Moran’s I is not canonical. In geographic terms, there is a modifiable aerial unit problem inherent in the measure. One popular application of this is the phenomenon of gerrymandering. As mentioned, the choice of \(W\) can also change the results of spatial analyses. Edge aerial units by definition don’t have a “full” neighborhood, a constraint that can significantly affect the results especially if the weights to the right of border aerial units are high in \(W.\) On the other hand, aerial units could have largely differing amounts of neighbors, again qualifying the generalizability of the measure.
Further, patterns such as the total division between high and low values between two halves of a map result in nearly no spatial autocorrelation, even though it practically is high. Only random shuffling significance tests would show us that something is off with the Moran’s I for these cases. Additionally, some choice of a random variable \(X\) suffer under the constraint of non-uniformity in space. For example, bank robberies can only occur in banks. Ultimately, Moran’s I is a global measure that does not reveal the source of a high or low value given we don’t know which areas of our space contribute to the degree of our Moran’s I. For this reason, it often makes sense to visualize local indices of spatial autocorrelation (LISA).
Conceptually, LISA is similar to Moran’s I but instead of measuring the global spatial autocorrelation, it measures the extent values of \(X\) at sites \(i\) are related to values of \(X\) at neighboring sites. The formula for LISA is given by
\[ I_i =\frac{1}{\sum_{j = 1}^{n}w_{ij}} \frac{(X_i - \overline{X}) \sum_{j=1}^{n}w_{ij}(X_j - \overline{X})}{(X_i - \overline{X})^2 / n}, \] and we see that the LISA statistic is proportional to Moran’s I in that the Moran’s I is the mean of the set \(I_i\) where \(i\) is defined such that all locations of our spatial division are included.
The procedure for the LISA significance test is equal to the one for the Moran’s I metric, but is conducted for a location \(i\) instead of the whole data set. It is noteworthy that when we reshuffle for a particular location \(i,\) we do so for each of the \(j\) neighbors but not for \(i\) itself. Our hypotheses test are:
\[ H_{i_0}: \text{There is no spatial autocorrelation at location } i. \]
\[ H_{i_{a_1}}: \text{There is positive spatial autocorrelation at location } i. \]
\[ H_{i_{a_2}}: \text{There is negative spatial autocorrelation at location } i. \]
The interpretations are equal to those given for the Moran’s I value, but constrained to a location \(i.\)
Multiple regression analysis is a statistical technique that inputs k independent variables to identify the unique surface in k+1 that minimizes the sum of squared residuals from our dependent variable observations. A residual is the shortest distance in k+1 dimensions from the best fit surface to the true observed data point in the dependent variable. Given that all assumptions for a Multiple Linear Regression are fulfilled, the core idea is that given our historic data on our k independent variables, the best fit surface will be a reliable and robust predictor for future values of the dependent variable. In standard statistical literature terminology, we say that we have a good regression model if our k independent variables ‘explain’ much of the variation, that is, if our best fit surface is close to the observation of the dependent variable for each observation.
In order to use Multiple Regression Analysis, we have to ensure that all assumptions are met:
Linearity
Normality of residuals
Homoscedasticity of residuals
No multicolinearity
Continuity of dependent variable
Residuals are random
Assumption of independence
For a detailed explanation of OLS regression and its underlying assumptions, please refer to our previous assignment. In the results section of the aforementioned paper, we observed a common occurrence wherein data with spatial components exhibited clustering in the residuals. Notably, we consistently underestimated high logged median household values while overestimating low household values. Visualizing the residuals of the OLS regression often helps identify data with such spatial components. Below, a map of standardized residuals illustrating the clustering phenomenon:
The presence
of clustered standardized errors signals the existence of an unaccounted
spatial process. Simply put, the seemingly random noise conceals
pertinent information crucial for evaluating the efficacy of our
Ordinary Least Squares (OLS) regression. Consequently, the assumptions
of homoscedasticity and independence of residuals are compromised,
thereby violating key tenets of our regression analysis. Furthermore,
the significance values associated with the \(\beta\) coefficients may be artificially
inflated. An alternative method for detecting spatial autocorrelation
involves regressing the residuals of the OLS model against the lagged
neighboring residuals. This technique, clarified in the subsequent
subsection, yields \(\beta\)
coefficients termed as “b slope” for the lagged residuals.
Several methods exist to address a spatial process in our dependent variable \(X.\) Initially, we will introduce spatial lag and spatial error, followed by an exploration of geographically weighted regression. All spatial lag and spatial error models will be executed using the R software.
A spatially lagged variable is the average of that value at those aerial units identified as neighbors by the contiguity weight matrix \(W.\) Applying this to the previous section, if we were to see a significant relationship between our residuals and lagged residuals, the assumption of independence of residuals would be violated. This also implies that \(\beta\) coefficients and their significance values may be inaccurate as they are inflated, which was discussed in the previous section. The goal of both spatial lag and spatial error regression is to remove these spatial dependencies in the residuals and to induce less heteroscedasticity.
Both types of regressions, along with all OLS regression assumptions except for the independence of residuals, posit that the dependent variable in one location is linked to the values of that variable in nearby locations. We retain the assumption of independence of residuals, as it represents a factor we seek to address. Both models look to remedy this by incorporating an additional predictor.
We have given two tests for identify spatial dependencies, one visual and one that regresses residuals on lagged residuals. The reader might now ask himself if there are any fixed cut-off values after which one can be sure to use spatial lag or spatial error models. Whereas such a value doesn’t exist, we usually recommend running both OLS regression and spatial lag or spatial error regression if in doubt and compare the results on a number of criteria. We will discuss this in more detail in section “Spatial Lag vs OLS Regression” and “Spatial Error vs OLS Regression.”
Spatial lag ideally corrects the OLS model by including the lagged variable as an “independent variable” into the regression:
\[ y = \rho Wy + \beta_0 + \beta_1X_1 + \ldots + \beta_nX_n + \epsilon \] where \(\rho\) is the lag error parameter of the \(y-\)lag variable \(Wy.\) As such, \(\rho\) can take values between -1 and 1. As \(y\) is the output of our first OLS regression, spatial lag can be seen as a two-step regression model. Note: the new summand in the formula is a new predictor that ideally includes information of the spatial dependencies that previously were contained in the error term \(\epsilon.\) This, however, is not necessarily always the case. \(\rho\) can intuitively be interpreted as the coefficient to the \(y\)-lag variable \(Wy.\) The interpretation of \(\beta\) coefficients in the spatial lag model is complicated and not equivalent to that in spatial error and OLS regression. Therefore, we will just interpret the spatial lag depending on whether the value is positive or negative. Positive values indicate positive autocorrelation, and vice versa.
In case our added corrective variable \(W_name\) is significant with \(p < 0.05,\) we can say that our dependent variable has a special process, as it demonstrates a significant correlation with adjacent values of the independent variable. It is essential to note that the inclusion of this corrective variable not only tends to alter parameters such as the intercept or coefficients but also affects the significance levels associated with each. In examining heteroscedasticity, which is linked to the independence of errors, we employ three testing methods: the White Test, the Breusch-Pagan Test, and the Koenker-Bassett Test. These tests help assess the presence of heteroscedasticity in our model, providing valuable insights into the variance structure of the residuals.
White Test: This test serves as both a heteroscedasticity-consistent covariance matrix estimator and an evaluator of heteroscedasticity, without necessitating a predefined form for the alternative hypothesis regarding changing variances. In contrast to the Breusch-Pagan test, the White Test exhibits a broader applicability, making it a versatile tool for detecting and addressing heteroscedasticity in regression models. Its flexibility allows for a more comprehensive assessment of variance structures, contributing to the robustness of our statistical analysis.
Breusch-Pagan Test: For the assessment of both heteroscedasticity and random coefficient variation, this test derives its test statistic by computing the squared residuals from the OLS regression model. In this process, the test scrutinizes the variance of the residuals by contrasting it with the variance predicted by a supplementary regression model. This auxiliary model utilizes the same independent variables employed in the original model to forecast the squared residuals. By comparing these variances, the test provides valuable insights into the presence and nature of heteroscedasticity and random coefficient variation within the regression framework, enhancing our understanding of the model’s performance.
Koenker-Basset Test: A modified version of the Breusch-Pagan test, this variant considers the potential impact of individual data points on the assessment of heteroscedasticity. Notably, a high value for this statistic not only signals the presence of heteroscedasticity but also suggests a correlation between the variance of the residuals and the magnitude of the predicted values. This nuanced perspective enhances the diagnostic capability of the test, providing a more comprehensive understanding of the interplay between individual data points and the variability in the model’s predictions.
AIC and SC are measures of the goodness of fit of an estimated statistical model. These measures act as relative indicators, quantifying the information lost when employing a particular model to represent reality. Essentially, they capture the delicate balance between the precision and complexity of the model, providing valuable insights into the model’s efficacy in capturing essential patterns while avoiding overfitting. The AIC is given by:
\[ \text{AIC} = -2\log (p(y \mid \omega_{ML}))+2K \] where y is some event and \(\omega_{ML}\) is the value that maximizes the likelihood function for \(y.\) Then \(p(y \mid \omega_{ML})\) is the maximized value of the likelihood function for the model. It becomes evident that when comparing models, more negative models have a better goodness of fit due to their greater maximized value of their likelihood function. The SC is related to the AIC but the AIC is usually preferred to predictive models whereas the SC is used for more parsimonious models. This is because the AIC favors models that that most adequately describe an unknown while the SC tries to find a true model in a set of candidates. Therefore, AIC and SC allow us to compare spatial lag with spatial error.
The log likelihood method uses the maximum likelihood method of determining the best fitting data given the parameters of a statistical model, where a higher value indicates a better fit. We are only able to use log likelihood to compare nested models – the OLS regression model is a nested version of the spatial lag and spatial error regression models that does not include the spatial lag or spatial error terms. We can compare our OLS model to each spatial regression separately, but we are unable to compare the spatial regressions to each other using the log likelihood criteria.
The LRT assesses the goodness of fit between two nested models:
\[ H_0: \text{Log-Likelihood Standard OLS Reg. - Log-Likelihood Spatial Lag Reg. = 0.} \\ H_A: \text{Log-Likelihood Standard OLS Reg. - Log-Likelihood Spatial Lag Reg.} \not= 0. \] Models are nested if one two models have the same dependent variable but one of the model has the same and potentially more independent variables. As we see in the formula for spatial lag regression, we see that OLS regression is nested within spatial lag regression where \(\rho = 0.\) The likelihood function of a model is the probability that we actually observe the values of dependent variable \(X\) given our \(\beta\)-coefficients. The log-likelihood function is the logarithm of the likelihood function for a given model. Provided this information above, the hypotheses tests start to make sense since in case we reject \(H_0,\) this would imply that one set of coefficients is significantly better than the other in predicting the observed values of \(X.\)
Since adding the spatial process usually significantly increases the accuracy of a model (given that there is spatial autocorrelation), we usually reject \(H_0\) for \(p < 0.05\) and state that the spatial lag is a more predictive model than the standard OLS model. LRT can be seen an evaluation that assesses whether our additional variable actually improved the model in terms of accuracy or not. With regards to the spatial lag regression, the LRT is hence a test whether our new “predictor” \(\rho Wy\) absorbed some effect of the spatial process in the error term of the normal OLS regression. Note: LRT penalizes models that are more complex (e.g. have more independent variables) but yield the same goodness of fit as a nested model with fewer independent variables. Furthermore, we cannot compare spatial lag and spatial error regression with LRT since these models are not nested.
Spatial error is a two-step regression model that ideally corrects the OLS model by splitting the error term \(\epsilon\) into random noise and the spatially lagged residual:
\[ y = \beta_0 + \beta_1X_1 + \ldots + \beta_nX_n + \lambda W \epsilon + \mu \] where \(\mu\) is the random noise, \(\lambda W \epsilon\) is the spatially lagged residual with the lag error parameter \(\lambda \in [-1, 1],\) and \(\epsilon = \lambda W \epsilon + \mu.\) We get \(\epsilon\) by regressing residuals on the residuals of neighbors as defined by \(W.\) This is the second step in the two-step regression process and in the best case extracts the spatial dependencies into the spatially lagged residual \(\lambda W \epsilon.\) Note that \(\lambda W \epsilon\) now is a predictor in our model. \(\lambda\) can intuitively be interpreted as the coefficient of the spatially lagged residual \(W \epsilon.\)
Our respective beta coefficients in the above formula are the solutions to the optimization problem in the multiple regression equation presented in our first analysis. There is exactly one beta coefficient for each independent variable. Beta coefficients are unit-free and the sign of the beta coefficient indicates whether relationship between the dependent and independent variable is positive or negative in the context of our regression model. Intuitively, a beta coefficients tell us by how much the value the dependent variable increases when the predictor \(i\) goes up by one unit and all other predictors are kept equal. As such, the beta coefficient can be thought of a strength relationship between dependent and independent variable in the context of our regression model. The \(\beta\) coefficients are statistically significant if the values fall below the p-threshold. Intuitively, the above also holds for the coefficient \(\lambda.\)
Everything that we explained for Spatial Lag vs OLS Regression also holds for spatial error regression. Again, OLS regression is nested within spatial error regression with \(\lambda = 0.\)
The choice of whether or not to use a spatial regression model depends on the outcome of the Lagrange Multiplier (LM) test. If neither the LM test for spatial lag nor the LM test for spatial error is significant, then OLS regression is sufficient, and no spatial regression model is needed. However, if one or both of the LM tests are significant, then a spatial regression model should be used.
If only one of the LM tests is significant, then the spatial regression model type corresponding to that significant test should be used. For example, if the LM test for spatial lag is significant but the LM test for spatial error is not, then a spatial lag model should be used.
If both the LM test for spatial lag and the LM test for spatial error are significant, then the decision of which spatial regression model type to use is based on the Robust LM values and probabilities. The spatial regression model type with the lower p-value or higher test statistic should be chosen.
For spatial lag and spatial error regressions as well as for the OLS regression, we have one consequential constraint - all of our predictors are globally stationary given that regardless of the location \(i,\) we apply the same coefficients to estimate a value for our dependent variable \(X.\) The idea that the modeled relations are constant throughout space (e.g. the study area) is called spatial stationarity, which does not always hold. One likely consequence of spatial stationarity could be spatial clustering in the residuals as we saw in the previous paper, assuming that close neighborhood effects are a more accurate and significant predictor of nearby values. This notion is also consistent with Tobler’s First Law of Geography.
A good way to illustrate the above is Simpson’s Paradox. Simpson’s Paradox is a phenomenon in statistics in which a trend appears in several groups but disappears or reverses when aggregating all groups into one. The reason this can happen is because even though some parts of the data might have a very clear trend, the aggregation process can even out many of these effects or even reverse them. This is something that likely happens to spatially constant coefficients and might ultimately lead to spatial clustering. One way to avoid these issues altogether is using local regression. Local regression runs a OLS regression for each location \(i\) and points in the data set that are considered “spatially close”. We will later see that we define this notion of spatial closeness through bandwidths. In GWR, data from locations farther away are given less weight, thereby increasing the effect of direct neighbors on the regression estimates for location \(i.\) Therefore, location regression is specific for a location \(i\) and employs the idea behind Tobler’s First Law of Geography.
The equation for GWR is:
\[ y_i = \sum_{k = 1}^{m} \beta_{ik}x_{ik}+ \epsilon_i \] where \(\beta_{ik}\) is the \(i\)-th location specific \(\beta\)-coefficient, \(x_{ik}\) is a data point that is considered spatially close to \(i,\) and \(\epsilon_i\) is the \(i\)-th location specific residual. GWR has the same assumptions has spatial lag and spatial error. Ideally, the data set includes at least 300 observations.
In local regression, the concept of “spatially close” is crucial for determining the weights assigned to neighboring observations when estimating the value at a specific location \(i\). This definition of spatial proximity is typically represented by a bandwidth, which defines the maximum distance or the number of neighboring observations considered when fitting the local regression model.
Two primary approaches exist for defining the bandwidth: fixed and adaptive. A fixed bandwidth specifies a predetermined distance threshold for all locations. Observations falling within this distance are considered neighbors and contribute to the local regression at the target location. Conversely, an adaptive bandwidth dynamically adjusts the distance threshold based on the local density of observations. This approach ensures that a consistent number of neighbors are considered for each location, regardless of the overall spatial distribution of data points.
The choice of bandwidth type, whether fixed or adaptive, can significantly impact the local regression outcome. A fixed bandwidth may lead to overfitting or underfitting in areas with sparse or dense data points, respectively. On the other hand, an adaptive bandwidth can adapt to varying data densities, potentially providing more consistent results across the study area.
In addition to influencing the spatial extent of neighboring observations, the bandwidth also determines the weighting function used in local regression. The weighting function assigns decreasing weights to observations as their distance from the target location increases. The specific form of the weighting function depends on the chosen bandwidth type and the desired smoothness of the local regression estimate.
Ultimately, the selection of an appropriate bandwidth is a crucial aspect of local regression modeling. The choice between fixed and adaptive bandwidths depends on the specific characteristics of the data and the research question being addressed. Careful consideration of the bandwidth’s impact on both the spatial extent of neighbors and the weighting function is essential for obtaining reliable and interpretable local regression results.
The fixed bandwidth technique has a Gaussian distribution, and one way to define it would be
\[ w_{ij} = e^{-0.5(\frac{\text{distance}_{ij}}{h})^2}. \] where \(h\) is the bandwidth distance and \(\text{distance}_{ij}\) is the distance between location \(i,j.\) A Gaussian distribution makes sense here since we don’t discriminate locations by the number of data points in the neighborhood of a point and therefore apply the same weighting distribution to each location. For a variable bandwith, the weighting function should be such that its distribution attributes higher weights than the Gaussian distribution to close points but \(w_{ij}\) rapidly decreases as distance increases. This is important since we do not know in advance the number of close data points we can capture and therefore have to give disproportionately high weights to close neighbors. If we do not do this, we would allow potentially very far points to have disproportionately high impact on our prediction.
One of the big limitations of GWR is that since we work with subsets of the dependent variable \(X,\) we are more likely to encounter local multicolinearity for our local regressions. Note that local multicolinearity does not imply global multicolinearity. Also, smaller sample sizes make it more difficult to detect a significant relationship between explanatory variables an a dependent variable. Additionally, explanatory variables in local regression are at risk insignificance since they are likely to have a very uniform distribution for the area within a bandwidth. We also have the multiple test problem in that we would need extremely many tests to determine significance of coefficients. With the multiple test problem also comes the issue of that we will have many Type I errors, even with low thresholds. The multiple test problem is also why we don’t have p-values associated to GWR.
The provided histograms illustrate the distribution of all raw variables: MEDHVAL, PCBACHMORE, PCTVACANT, PCTSINGLES, NBELPOV100, and MEDHHINC. It is crucial to observe that variables such as PCTBACHMOR, PCTVACANT, PCTSINGLES, and NBELPOV100 exhibit a skewed distribution - indicating the need for log transformation before conducting additional analyses. The skewed distribution of these variables may impact the assumptions of normality in statistical analyses. Log transformation is a common practice to address skewness, helping to stabilize variances and improve the adherence of the data to normal distribution assumptions. Additionally, this transformation aids in mitigating the influence of extreme values, contributing to more robust and reliable analyses. It is also essential to ensure that the data conforms to the assumptions of the chosen statistical methods, as deviations from normality can affect the accuracy of parameter estimates and hypothesis testing. Therefore, applying log transformation to these variables is an important step in preparing the data for rigorous and valid statistical analysis. Note: certain regression techniques yield more valid results when applied to normally distributed data. Therefore, transforming these variables using a logarithmic scale is recommended to enhance the accuracy and reliability of subsequent analytical procedures.
par(oma=c(0,0,2,0))
par(mfrow=c(2,3))
hist(shp$PCTBACHMOR, breaks = 40, col = "#FF5500",
main = "% of Pop with
a Bachelor Degree or Higher",
xlab = "PCTBACHMOR",
border = "white",
lwd = 0.1)
hist(shp$MEDHHINC, breaks = 50,col = "#FF5500",
main = "Median Household Income",
xlab = "MEDHHINC",
border = "white",
lwd = 0.1)
hist(shp$PCTVACANT, breaks = 40,col = "#FF5500",
main = "% Units Marked as Vacant",
xlab = "PCTVACANT",
border = "white",
lwd = 0.1)
hist(shp$PCTSINGLES, breaks = 40, col = "#FF5500",
main = "% of Units Marked
as Single Family",
xlab = "PCTSINGLES",
border = "white",
lwd = 0.1)
hist(shp$MEDHVAL, breaks = 50,col = "#FF5500",
main = "Median House Value",
xlab = "MEDHVAL",
border = "white",
lwd = 0.1)
hist(shp$NBelPov100, breaks = 50,col = "#FF5500",
main = "# of Households Below Poverty",
xlab = "NBelPov100",
border = "white",
lwd = 0.1)
Before proceeding with the transformation, it is crucial to identify variables with a minimum value of 0. Specifically, PCTBACHMOR, PCTVACANT, PCTSINGLES, and NBELPOV100 fall into this category. To ensure the accuracy of the transformation process, it is essential to add 1 to these variables that reach the minimum value of 0 before applying any further transformations. Hence, PCTBACHMOR, PCTVACANT, PCTSINGLES, and NBELPOV100, all with minimum values of 0, must be increased by 1 prior to undergoing the transformation process. This precautionary step is critical for preventing undefined results and ensuring the integrity of subsequent analytical procedures. The minimum values include PCTBACHMOR: 0, MEDHHINC: 2499, PCTVACANT: 0, PCTSINGLES: 0, MEDHVAL: 10000, BBELPOV100:0. The adjustment of variables with a minimum value of 0 by adding 1 is a standard practice in data preprocessing.
shp$Log_MEDHHINC <- log(shp$MEDHHINC)
shp$Log_MEDHVAL <- log(shp$MEDHVAL)
#log transformation +1
shp$Log_PCTBACHMOR <- log(shp$PCTBACHMOR + 1)
shp$Log_PCTVACANT <- log(shp$PCTVACANT + 1)
shp$Log_PCTSINGLES <- log(shp$PCTSINGLES + 1)
shp$Log_NBelPov100<- log(shp$NBelPov100 + 1)
par(oma=c(0,0,2,0))
par(mfrow=c(2,3))
hist(shp$Log_PCTBACHMOR, breaks = 40, col = "#FF5500",
main = "Log Trans. of % Pop
with Bach. Degree or Higher",
xlab = "LNPCTBACHMOR",
border = "white",
lwd = 0.1)
hist(shp$Log_MEDHHINC, breaks = 50,col = "#FF5500",
main = "Log Trans. of
Median Household Income",
xlab = "LNMEDHHINC",
border = "white",
lwd = 0.1)
hist(shp$Log_PCTVACANT, breaks = 40,col = "#FF5500",
main = "Log Trans. of
% Units marked as Vacant",
xlab = "LNPCTVACANT",
border = "white",
lwd = 0.1)
hist(shp$Log_PCTSINGLES, breaks = 40, col = "#FF5500",
main = "Log Trans. of % Units
Marked as Single Family",
xlab = "LNPCTSINGLES",
border = "white",
lwd = 0.1)
hist(shp$Log_MEDHVAL, breaks = 50,col = "#FF5500",
main = "Log Trans. of the
Median House Value",
xlab = "LNMEDHVAL",
border = "white",
lwd = 0.1)
hist(shp$Log_NBelPov100, breaks = 50,col = "#FF5500",
main = "Log Trans. of the
# of HH Below Povery",
xlab = "LNNBelPov100",
border = "white",
lwd = 0.1)
The premise of spatial autocorrelation centers around Tobler’s First Law of Geography. Articulated by Waldo Tobler in 1969, the law posits “everything is related to everything else, but near things are more related than distant things.” It suggests objects/phenomena that are geographically close to one another are more likely to be similar, or have a spatial relationship when compared objects that are farther apart. The first law of geography is the foundation of the fundamental concepts of spatial dependence and spatial autocorrelation - utilized specifically for the inverse distance weighting method of spatial interpolation and supports the regionalized variable theory for kriging. Ultimately, Tobler’s law is a fundamental assumption applied in all spatial analysis.
Therefore, we begin by defining neighbors for each of the block groups in Philadelphia. A polygon (in this case, block group) is considered to have a “Queen Neighbor Relationship” to another polygon if they share either a side and/or a vertex. This is compared to a “Rook Neighbor Relationship,” where two block groups only share a side. Neighbors based on contiguity are constructed by assuming that neighbors of a given area are other areas that share a common boundary. Neighbors can be of type Queen if a single shared boundary point meets the contiguity condition, or Rook if more than one shared point is required to meet the contiguity condition. We’ll prepare a weight matrix based on Queen neighbors, and visualize the connections between neighbors based on the weight matrix.
# Producing Queen Neighbor Relationship Outputs
queen<-poly2nb(shp, row.names=shp$POLY_ID)
summary(queen)
## Neighbour list object:
## Number of regions: 1720
## Number of nonzero links: 10526
## Percentage nonzero weights: 0.3558004
## Average number of links: 6.119767
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 27
## 4 16 52 175 348 493 344 177 62 28 10 4 2 1 1 1 1 1
## 4 least connected regions:
## 441 708 1391 1665 with 1 link
## 1 most connected region:
## 1636 with 27 links
# Visualize Connections Between Neighbors Based On Weight Matrix
reg_data_lines = nb2lines(queen,
coords = st_centroid(shp$geometry),
as_sf = TRUE)
tm_shape(shp) +
tm_borders(col = 'grey80', lwd = 0.5) +
tm_shape(shp) +
tm_dots() +
tm_shape(reg_data_lines) +
tm_lines(col = '#FF5800') +
tm_layout(frame = FALSE)