Monthly Archives: June 2012



Notes from Lecture and Various Papers 

Instrumental Variables

Instrumental variables are used when OLS estimates are biased by endogeneity or measurement error. The process is based upon identifying exogenous variation in the key independent variable.


I’m not going to go into how the IV estimator is constructed as it is well documented in EC406 notes, or see e.g. Stock and Watson.


If the regression is overspecified (i.e. there are more instruments than endogenous regressors) then a Hansen-Sargan test can be used to test the exclusion restriction – although the instruments will pass the test if they are all equally endogenous i.e. it is a weak test. In general the F-stat should be > 10 in the first stage, and there should be a strong theoretical reasoning behind the instrument (such that the “compliers” are meaningfully identified).


In the spatial context spatially lagged X variables have been seen to be used as instruments for the spatial lag of Y. However, as we have already seen this method is not without its complications (correctly specifying the functional form/ exogeneity restriction violated). Thus, the literature has begun to move toward adopting the quasi-experimental method by searching for instruments based on policy changes, boundaries, geological features etc. or other similar type events.


Some examples


Hoxby, Does Competition among Public Schools Benefit Students and Taxpayers? (2000)

This is surely one of the most famous examples of a spatial IV. The paper examines whether increased school competition in the form of a greater number of school districts within a municipality has benefits for the population studied. OLS estimates are biased because the supply of school districts is in part a response to the demand for school districts which is probably driven by wealth, ability, parental involvement, and other unobservable characteristics which codetermine student outcomes and cannot be readily controlled for. Thus, Hoxby uses an instrument to attempt isolate exogenous variation in the supply of school districts, to get a consistent estimate of the effect competition has on student outcomes.  The instrument is based on the number of streams and rivers within a municipality. The logic is that in the 19th century when school districts were being drawn up, geological features such as streams presented barriers to movement such that districts were often drawn up with the streams forming natural boundaries. Thus, a municipality with more streams would have more school districts, hence the instrument is relevant. Over time, the importance of streams in terms of determining outcomes has diminished, and hence the presence of more streams has no effect on educational outcomes other than through its effect on determining school districts in the 19th century, and hence the exclusion restriction is satisfied.


There are problems with the strategy. Specifically, rivers may still have an economic effect today, and this could feed back into educational outcomes. Additionally, the way the instrument was constructed has been criticized, as it was subject to much subjective judgement.


Luechinger, Valuing Air Quality Using the Life Satisfaction Approach (2009)

This paper is trying to gauge how important air quality is for affected populations. The hedonic method of valuation (which seeks to determine the unobserved price of a public good by using prices embedded in private goods) tends to underestimate the value of air quality as migration is costly, and private goods prices are based on perceived rather than objective risk. Any residual effect that air pollution has on life satisfaction is indication that compensation has not been fully capitalized in house prices for the reasons just stated.


However, an OLS estimates of air quality on life satisfaction would be biased as cleaner air is the product not only of exogenous policy change (even assuming it is exogenous), but also of local industrial decline and economic downturn. These simultaneous developments can have a countervailing effect on life satisfaction and housing rents. Thus he uses an instrument for SO2 levels; the mandated installation of scrubbers at power plants


The construction of the instrument is somewhat convoluted as it is relies upon a difference in difference estimation. Desulphurization pursuant to retroactive fitting of scrubbers at power plants is the treatment, with a county being down or upwind of the power plant determining assignment to treatment and control group respectively. Yet, as being in treatment/control is a question of degree rather than kind, the treatment group variable is a frequency measure of how often in the period of study the county in question is downwind of the plant. This is likewise multiplied by a distance decay function and the pre-desulphurization levels of the plant in question is controlled for.


The main finding is that SO2 concentration does negatively affect life satisfaction, with estimates being much larger for the OLS specification indicating that reductions in sulphur levels are indeed accompanied by factors that have a countervailing effect on satisfaction.


Gibbons et al. Choice, Competition and Pupil Achievement (2008)

This paper uses a boundary discontinuity in order to construct an instrument for primary school competition in the UK which gets around the endogeneity concern in OLS estimates, namely that motivated parents may move closer to popular schools. The boundaries in question are the Local Education Authority boundaries. Whilst families are allowed to make application to schools outside of their LEA, cross-LEA attendance is extremely uncommon.


They construct indices for choice: for each school they define a travel zone school  that a) encompasses all residential addresses within the same LEA and b) that are contained within a circle whose radius is the median of the travel to school distance for the pupils at that school. Pupil choice is thus the number of travel to school zones in which the student lives, and the School competition measures is the average of this value for students actually attending a given school (i.e. the number of alternatives available to student of a particular school). If families sort spatially near to high performing schools this will tend to decrease apparent competitiveness.


They then exploit the fact that families living near boundaries face longer journeys to school than those in the interior, and as such they are more likely to attend their local school. This is because the catchment area is bounded and hence shrinks. Thus the distance between a pupil’s home and the LEA boundary is an instrument for school choice, and the distance between a school and the boundary is an instrument for competitiveness. They do not find evidence that school competition increases pupil achievement.


Differencing Methods

Often there will be spatial sorting an heterogeneity i.e. differences between places that lead to biased estimates. This sorting will often be on observable characteristics, but just as frequently on unobserved characteristics.


One method for dealing with this is the fixed effects model. This can be estimated with panel or cross sectional data using area dummies, or by making the within groups transformation (de-meaning) and then estimating with OLS. This removes the area specific time invariant determinants of the dependent variable.


With panel data can be time differences which has the same effect. Time dummies can also be included to strip out variation common across regions due to time trends. The remaining variation is time variant region specific variation, and as such for the estimates to be unbiased there can be no correlation between region specific time variant shocks and the error term. For example, there could be no sudden shock to the educational system in a given area that induced people to sort spatially into that area.


The difference in difference method is usually applied to evaluating policy interventions where a treatment and control can be created. I am not going to go into the mechanics here as it is well documented elsewhere.


Some Examples


Manchin et al. Resources and Standards in Urban Schools (2007)

The paper is concerned with whether additional resources can be used to improve the outcomes of hard to reach pupils specifically evaluating Excellence in Cities programme that gave extra funding to schools based upon their level of disadvantage as measured by the proportion of pupils eligible for a free lunch. They use a DID strategy comparing the outcomes in EiC schools with a comparison group. A direct comparison between EiC and non-EiC schools would not be valid as there is no reason to assume that the parallel trends assumption holds. Mindful of this the authors using propensity scores based on a host of school and pupil level characteristics to create a subset of non-EiC schools which are statistically similar to the pre-treatment EiC as schools, and they use this subset as the control group. They do not make a hugely convincing argument for this method, and indeed there are statistically significant differences in the outcome measures in the pre-treatment periods, indicating that there is only limited reason to suspect that the key identifying assumption holds.


They find that the policy was effective in raising pupil attainment in the treatment schools but that the benefits were restricted to the students best able to take advantage of the policy (i.e. the most gifted).


Duranton et al. Assessing the Effects of Local Taxation Using Microgeographic Data (2011)

This in an interesting paper that seeks to identify the effect of local property taxation on the growth of firms. Estimating this has been difficult as site characteristics are heterogeneous, and many characteristics will be correlated with unobservable determinants. Secondly, firms are heterogeneous, and this differences are often largely unobservable, yet these differences cause them to sort spatially. Lastly tax systems may be endogenous to location decisions of firms.


Using panel data they estimate a model which includes firm specific observable characteristics which removes firm specific time varying observable variation. They include a firm fixed effect to remove the time invariant firm specific unobservable variation. They also include higher level fixed effects (site, and region). They then difference the data in the usual way which implements the fixed effect strategy as noted above.


They then do a spatial difference. This takes the difference which is the difference (in the difference) between each establishment and any other establishment located as a distance less than d from that establishment. If there is a term αzt for each site z in time t, and this is not controlled for, then any local shock to firms that also affects tax rates will bias the panel estimates above. However, if we are able to assume that for small changes in d, Δαzt ≈ 0 (i.e. local shocks are smooth over small amounts of space), then by spatially differencing the alpha term falls away, and the time varying local shocks are effectively controlled for.


They then combine this with an instrumentation strategy that instruments tax rates using political variables.  





Notes from Lecture 


Firms and individuals have choices over discrete alternatives such as which mode of transport to take, or where to locate their businesses. These choices are modeled using the random utility model in order to aide in economic interpretation of those choices.

Random Utility Model

This was developed by Daniel McFadden and underlies the discrete choice model. This model holds that preferences over alternatives are a function of biological taste templates, experiences and other personal characteristics some of which are observable, others of which are not (cultural tastes etc.), and the function is heterogeneous within a given population. This indicates that an individual/firm’s utility from choice j can be decomposed into two components:

Uij = Vij – εij 

where V is an element common to everyone given the same characteristics and constraints. This might include representative tastes of the population such as the effects of time and cost on travel mode choices. ε is a random error that reflects the idiosyncratic tastes of the individual concerned as well as the unobserved attributes of the choice j.

V is observable based on consumer/firm choice characteristics such that:

Vij = αtij + βpij + δzij

where t is time and p is price and z is other observable characteristics.

In a setting where there are two choices (e.g. car or bus to work) we observe whether an individual chooses car (yi = 0 ) or bus (yi = 1). Assuming that individuals maximize their utility, they will choose bus if this exceeds the utility from going by car Ui1 > Ui0 which means that Vi1– εi1 > Vi0 – εi0 which indicates that εi1 – εi0 < Vi1 – Vi0. Therefore the probability that we see an individual choose to go by bus is:

P(εi1i0 < Vi1 – Vi0)

Which is equal to P(εi1– εi0) < α(Ti1 – Ti0) + β(Pi1 – Pi0)) 

If we are willing to assume that the probability depends linearly on the observed characteristics then this can be estimated by running the following OLS regression:

Yi1 = α(Ti1 – Ti0) + β(Pi1 – Pi0) + εi1

At this point further observable characteristics can be added, z.

However, as is well known, the OLS model is not bounded by 0 and 1, whereas probability functions are. This means that this estimation may return results outside the possible range of probabilities. In order to counter this problem we can estimate a probability function using probit or logit estimators which are calculated using the maximum likelihood method [of which I am not going to which anything – assuming it will not be examined in detail].

The McFadden paper deals with car versus bus commuting in the SF Bay area.

Multiple Choices

Often we want to think about more than one choice, which requires us to extend this model. We can extend the random utility model to many choices Uij = Vij + εij. Now an actor will choose alternative k if the utility derived from this choice is higher than for all other choices:

Vik + εik > Vij + εij for all j≠k 

If we assume an extreme value distribution then the solution for the probability choice is given by P(yi = k) = exp(Vik) / ∑ exp(Vij). This is a generalization of the logit model with many alternatives, hence the name “multinomial logit”. The model compares choices to some predetermined base case.

Independence of Irrelevant Alternatives (IIA)

One drawback of the multinomial logit method is the IIA problem. This is driven by the assumption underlying the model, that if one choice is eliminated in time t=1, the ratio of individuals choosing the remaining option much remain constant from the pre-elimination period t=0. For example if in t=0 40 people take bus A, 12 people take bus B and 20 people drive, and then in t=1 the B bus company goes bust. In t = zero, the ratio of people driving relative to those taking bus A is 2:1. This must remain constant in t=1 so the model assumes that 24 people will dive and 48 will take bus A. This might not be a valid assumption is bus seats are not supplied elastically, or bus A and bus B were not substitutes.

It is simple to see why this is the case, as the underlying assumption of the model is that P(yi = k) = exp(Vik) / ∑ exp(Vij), and this clearly cannot change simply because one of the other alternatives has been eliminated. 

This can be solved using the nested logit model. Conceptually this decomposes the choices into two separate stages. In the first stage the individual chooses whether to take his car or public transport. If he decides on public transport then he must decide between bus A and bus B. This choice structure is estimated using sequential logits whereby the value placed on the alternatives in the second stage are entered into the choice probabilities in the first stage.

Aggregate Choice Models

Aggregate choice models are useful when individual data are not available, and also when computing power is an issue (due to many fewer observations). All of the above models have aggregate equivalents. In fact, using the Poisson model with a max likelihood estimation method, aggregated data give exactly the same coefficient estimates as the conditional logit model when the only data available are the choice characteristics (i.e. how many people chose what). Multinomial logit will be better when there are accompanying individual/group-level characteristics.

Gravity Models

Choices can also be modeled as flows between origins and destinations. This is widely applied in the fields of trade, migration and commuting.  A flow from place j to k can be modeled as:

Ln(njk) = βXjk + αj + αk + εjk

where the alphas represent characteristics of the source and destination such as population, wages etc., a cost of moving measure can also be included. This literature has found strong distance decay effects, which are puzzling in many cases (e.g. trade) as the cost of moving goods further is now fairly marginal.

Discrete v Aggregate: discrete choice models have the advantage that firm level characteristics can be incorporated, and there is a strong theoretical model underlying the estimations. Aggregate flows on the other hand are easier to compute and there is no need to make assumptions about the functional form that are necessary for the non-linear maximum likelihood estimators. One disadvantage is that no separation of the individual/aggregate factors is possible.



Notes from lecture and various articles 


Generally there is very little reason to suppose that a process will be generated randomly over space. Spatial statistics help us to gauge to what extent the values that data take are related to other observations in the vicinity. 

Spatial statistics broadly fall into two categories:

1)     Global – these allow us to evaluate if there are spatial patterns in the data (clusters)

2)     Local – these allow us to evaluate where these spatial patterns are generated

Differences between these two statistics a can be summarized thus:



Single Values Multi-valued
Assumed invariant over space Variant over space
Non-mappable Mappable
Used to search for regularities Used to search for irregularities
Aspatial Spatial

Generally these statistics are based upon:

  1. Local means – see spatial weighting sections above (smoothing techniques such as kernel regression and interpolation).
  2. Covariance methods – comparing the covariances of neighbourhood variables (Moran’s I, and LISA)
  3. Density methods – the closeness of data points (Ripley’s K, Duranton & Overman’s K-density).

Moran’s I

This is one of the most frequently encountered measures of global association. It is based on the covariance between deviations from the global mean between a data point and its neighbours (howsoever defined – e.g. queen’s/rook’s contiguity at the first/second order etc.).

It is computed in the following way:  

Where there are n data values, y is the outcome variable at location i or its neighbour j, the global mean is Yg and the proximity between locations i and j are given the weights Wij.

A z statistic can be calculated in order to assess the significance of the Moral I estimate (compared in the usual way to a critical value e.g. 1.95 for 5% significance). 

Problems with this measure are that it assumes constant variation over space. This may mask a significant amount of heterogeneity in spatial patterns, and it does not allow for local instability of variation. Thus a focus on local patterns of spatial association may be more appropriate. This could involve a decomposition of this type of global indicator in the contribution of each individual observation. One further issue is that the problems associated with MAUP (see above summaries) are built into the Moran statistic.

Local Moran

The Local Moran is a Local Indicator of Spatial Association (LISA) as defined by Anselin (1995). He posits two requirements for a statistic to be considered a LISA:

  1. The LISA for each observation gives an indication of the extent of spatial clustering of similar values around that observation.
  2. The sum of the LISAs for all observations is proportional to a global indicator of spatial association.

The local Moran statistic allows us to identify locations where clustering is significant. It may turn out to be similar to the global statistic, but it is equally possible that the local pattern is an aberration in which case the global statistic would not have identified it.

It is calculated like this:

Ii = Zi [∑j=1nWijZj, j=i]


where z are the deviations of observation i or j from the global mean, and w is the weighting system. If I is positive then the location in question has similarly high (low) values as its neighbours, thus forming a cluster.

This statistic can be plotted on the y axis, with the individual observation on the x axis, to investigate outliers, and see whether there is dispersion or clustering.


There are problems with this measure. Firstly the local Moran will be correlated between two locations as the share common elements (neighbours) Due to this problem the usual interpretation of significance will be flawed, hence there is the need for a Bonferroni correction which will correct the significance values (thus reducing the probability of a type I error – wrongly rejecting the null of no clustering). MAUP is an issue similarly as above.

Point Pattern Analysis

This type of analysis looks for patterns in the location of events. This is related to the above techniques, although they are based on aggregated data of which points are the underlying observations. As the analysis is based on disaggregated points, there is no concern about MAUP driving the results.

Ripley’s K

This method counts a firm or other observation’s number of neighbours within a given distance and calculates the average number of neighbours of every firm at every distance – thus a single statistic is calculated for each specified distance. The benchmark test is to look for CSR (complete spatial randomness) which states that observations are located in any place with the same constant probability, and they are so located independently of the location of other observations. This implies a homogenous expected density of points in every part of the territory under examination.

Essentially a circle of given distance (bandwidth) is centred on an observation, and the K statistic is calculated based on all other points that are located within that circle using the following formula:

K(d) = α/n2 * ∑i=n i=1i≠j I{distanceij < d 

where alpha is the area of the study zone (πr2), and I is the count of the points that satisfy the Euclidean distance restriction. If there is an average density of points µ, then the expected number of points in a circle of radius r, is µπr2. As the K statistic is the average number of neighbours divided by the expected number of points µ, this means that CSR leads to K(r) = πr2.

Again, the returned density by distance can be plotted against the uniform distribution to see whether observations are clustered or dispersed relative to CSR.

Marcon and Puech (2003) outline some issues with this measure. Firstly, since the distribution of K is unknown, the variance cannot be evaluated, which necessitates using the Monte Carlo simulation method for constructing confidence intervals. Secondly there are issues at the boundaries of the area studied, as part of the circle will fall outside the boundary (and hence be empty) which may lead to an underestimation at that point. This can be partially corrected for by using only the part of the circle’s area that is under study.

Additionally, CSR is a particularly useful null hypothesis, other benchmarks may be preferable.

Kernel Density

These measures yield local estimates of intensity at a specified point in the study. The most basic form centres a circle on the data point, calculates the number of points in the area and divides by the area of the circle. i.e:

δ(s) = N(C(s, r)/ πr2

where s is the individual observation, N is the number of points within a circle of radius r. The problem with this estimate is that the r is arbitrary, but more seriously, small movements of the circle will cause data points to jump in and out of the estimate which can create discontinuities. One way to improve on this therefore is to specify some weighting scheme where points closer to the centroid contribute more to the calculation than those further away. This type of estimation is called the kernel intensity estimate:

δ(s) = ∑i=n, i=1 1/h2 * k(s – sj / h)


where h is the bandwidth (wider makes estimate more precise, but introduces bias) and K is the kernel weighting function.