Return to Courses Listing Return to PAV Home Page Return to 781 Home Page Email to PAV 
City and Regional Planning 871.05The files here are intended for the use of students in C&RP 871.05 at OSU. Link to Restricted SiteHere . Note that you will need to log in with your KSA login ID and password (same as for the KSA network).
GeneralSome general information on the R system, with some usability tips
1 Getting RSome instructions for setting up the R system (mostly for Windows). If you use them and have problems, please let me know, so I can revise them Exercise 1.1 Download and set up the R system according to the instructions Exercise 1.2 Download the R script spatialpackages.zip . Unzip the included script spatialpackages.R, open R and run this script while you are online, to install several spatial packages.
2 Importing DataSome instructions Exercise 2.1 Here is a link to a csv file containing data on crimes in the US in 1999, by state; and here is a page describing what is in the data. Import the data into R, and examine it. Exercise 2.2 Save the data in R’s internal form, exit R, and then import (load) the data you saved. Examine the new data: it should be the same as the original one. You can reimport the csv file by assigning it to a different data frame (remember that when you load a dataset, you have no say in the names assigned: they are just the ones existing when you saved the original data). Can you figure out how to check that (say) the variable area is the same in the two data frames (ie the one you loaded and the one you reimported via read.csv)?
3 Working with Data FramesSome information . You may also want to look at the online books for more on data frames. Exercise 3.1 Reload (or reimport) the 1999 crimes dataset. The data in the final column (last variable) in the dataset is shall_rap_PA . Figure out how to delete it from the data frame. Exercise 3.2 You will notice that the states’ names are in antialphabetical order. Figure out how to reverse this. Hint: look at the documentation for order. Which variable do you think you want to order? You may also want to read up on methods of indexing data frames Exercise 3.3 The name of the first variable is fipsstaq, when it should be fipsstat. Figure out how to correct this in R. Hints: you could use R’s internal data editor. But you could also do it via commands, or via a series of statements in a script: extract the variable names into some variable temp, edit the first entry in temp, and then overwrite the variable names in the data frame with temp. Exercise 3.4 Create a new variable in the data frame which is an indicator (0/1 variable) for the states of Michigan, Ohio and Indiana. Hint: you probably don’t want to do this using the state variable, since it is prone to typos. What else could you look at? Exercise 3.5 Using the variable you created in exercise 3.4, create a new data frame called (say) MOI, containing just the data for the three states. Use the data editor to verify that you’ve done this correctly; then delete MOI. Exercise 3.6 Instead of looking at crime rates per 100,000 population, it may be preferable to base analyses on the logarithms of the rates. Figure out how to create new variables, say lvio for the log of vio, and add them to the data frame. If you’re into programming, you may want to think about how you could write a simple looping function to do this; but of course in the present case it’s easy to do “by hand”. Exercise 3.7 verify that if you try to refer to the variable state (without qualification) you get an error. Now attach the data frame, and verify that a reference to state works. What happens if you now define state<3.7 ? Do you think you have lost the original state forever?
4 Basic RegressionThe command to do linear regression in R is lm (for Linear Model): this command is loaded automatically when the system stares (in the library stats), so it is always available to you in your R session. Exercise 4.1 Specifying what goes into a regression (the lhs and rhs elements) is done by a special structure called a formula. Use R’s help facility to read up on formula. You may also want to look at what some of the other reference material on the restricted sites says about formulae. Exercise 4.2 With respect to the crimes data, the research question of interest is whether a particular crime type is affected by whether the state has passed a shallissue law. (Intuitively, it could go either way: if there is such a law, criminals could reason that the law has increased the chances that their victim will be armed, and this could deter them from committing crimes; on the other hand, if everyone is packing a gun, then there may be more murders in particular, as the contemporary US reverts to something like the old Wild West). Run some regressions using the crimes data to see if the shallissue variable is significant or not. You probably want to use the v_shall variable, since shalll (for example) contains missing observations. Exercise 4.3 If you do not specify that you do not want an intercept, one is provided automatically. How would you (a) specify that you do not want an intercept; and (b) explicitly ask the regression function to provide one for you? Exercise 4.4 With respect to your regression results, investigate the plotting options open to you. Exercise 4.5 Most researchers believe that the appropriate dependent variable is the logarithm of the relevant crime rate. Of course you could just create additional variables in your data frame, but as it happens this is not necessary. How can you avoid creating additional variables explicitly? Exercise 4.6 if you want variables v1 and v2 included as independent variables in your regression, you would include them via v1+v2 as part of the regression formula. What if you wanted to include the sum of v1 and v2 (as one new variable) and you didn’t want to create a new variable explicitly?
5 Spatial Data InputSome notes . Warning: these notes are very much ”under construction”. Spatial data structures are among the more complicated objects I’ve had to deal with in R, and I’m not sure I understand all the ins and outs. Take these notes for what they are worth. The best reference (with many examples) is Bivand et al, chapters 2 and 4. Exercise 5.1 The file s_01de10.zip contains a shapefile for the US states and its territories. Retrieve the zip file, unzip it to a convenient place, and then use R’s readOGR function to read it in. Call the resulting data frame geostates Exercise 5.2 Check that geostates is in fact a spatial polygons data frame Exercise 5.3 Plot what you have found: try the plot method. Because of the wide range of coordinates, the plot won’t be terribly helpful, but the outline of the US should be discernible. Exercise 5.4 Display the row.names of geostates. You will see that one element is NA (meaning: missing). This will hamper your ability to do anything useful with the data. Delete this row from the data frame. You may want to make a copy of geostates first, then work on the copy and assign back to geostates when you’re sure everything has worked correctly. This is also a situation in which it might be advisable to record your commands in a script, and then execute them linebyline until you’ve got it right. Exercise 5.5 Verify that the number of rows of the data in geostates and the number of rows in the polygons slot are the same. Exercise 5.6 Change the row names of geostates to be the FIPS variable, and verify that what you’ve done has worked, both for the main data slot and the polygons slot. Note: if you try to do this before you have deleted the missingdata rows, you will run into problems. In this case, one possibility to define your own missing data code, say "XX" and use it instead of the existing <NA> . If you try this, remember that FIPS is a factor: you will need to convert it to a character variable before you change it. Exercise 5.7 Create a new spatial dataframe MOI containing just the data for Michigan, Ohio and Indiana. Use the plot method to verify that you’ve got both the geographical and the dataslot values. Hints: match will return NAs where the two vectors don’t match. To select from a data frame you need a logical (ie entries are TRUE or FALSE) vector. You can probably figure out how to convert the result of match to the required logical type. Exercise 5.8 When all this has worked, save the new geostates (the version with FIPS row names) as in internal R file for later use.
6 Spatial Neighbors and WeightsThe foundation of spatial econometrics is the representation of spatial adjacency, an indicator of whether one region is or is not a neighbor of another; and spatial weights, which are usually specially processed (“weighted”) neighbors representations. In R, these representations are produced via functions in the spdep package. (Note that this package loads sp, so that sp’s functions are also available to you). It turns out that it is important to check that your geographical data in a shapefile is actually correct: for the US states, there are a number of files — including at least one from a US government source that are not. But we begin with some exercises using a shapefile on the Swiss cantons. The main reason for this is that it will allow us to visualize what we have produced. At the end of this set of exercises we return to the question of a shapefile for the US states. Again, it is useful to record your successful answers to the exercises in a script file for later use. Exercise 6.1 The file CHE_adm.zip contains two sets of spatial data of Switzerland: the CHE_adm0 set are for the whole of the country, while you want the CHEadm1 set, which is for the cantons (states/provinces). So extract the CHE_adm1 components of the shapefile to a hard drive. Read them into R as switz (which will be a SpatialPolygons data frame) using techniques you have already learned. Data source: http://www.divagis.org/gData which also contains politicalboundary shapefiles for other countries. Exercise 6.2 Plot the data. One thing to learn is how to add a title to the plot. It appears that for spatial objects like spatial polygons, the option main for the plot command does not work. You will need to add a title as a separate (lowlevel) plotting commend, using title. Produce a plot of the Swiss data, with title ”The Cantons of Switzerland” Exercise 6.3 Cleanup: we shall make a few changes in the data:
Exercise 6.4 Change the row names of the switz data frame to be the cantonal abbreviations. In the special case where the data frame is a spatial data frame, there is a more convenient way to do this than by using row.names. This is the spChFIDs function. Use it to reset the row names, and note that this function needs an assignment: that is, you want xx<spChFIDs(…) or else you won’t create a new object. Exercise 6.5 The function poly2nb creates spatial neighbors objects from spatial polygons. (Note that this is not a matrix of neighbors, but contains the information necessary to create a matrix, see next exercise). Create a neighbors object for queenstyle neighbors and another for rookstyle neighbors. Plot them, with appropriate titles. Do they look the same? The function diffnb may help you check whether your observation was correct. Exercise 6.6 You may also want to see a spatialneighbors matrix as opposed to a neighbors object, which is hard to export. Matrices are created by the function nb2mat. Create matrices for your two spatial neighbors (ie rook and queen neighbors) objects. Note that you will need to add a style argument if you want a 0/1 matrix, since the default form is rowstandardized (where all the entries in each row are normalized to sum to 1). If you print the resulting matrices, you will see that they get the row names right, but have numeric indicators of the columns. Convince yourself that setting names(mx)<cantons will not work here. You will need to set both items of the matrix’s dimnames attribute. Use write.csv to export the matrix you have just created as a csv file Exercise 6.7 Spatial distances. Look up the documentation on spDists and use it to create a matrix dswitz of distances between (the centroids of) the cantons. You will need to supply the matrix of spatial coordinates in the form coordinates(switz). Note that this matrix has numeric row and column identifiers: use dimnames to set them to the cantonal abbreviations. The usual way that distances are used in spatial analysis is a inverse distances, ie 1/distance. Create invdswitz as a matrix of inverse distances from dswitz. Note that this works even on elements of dswitz (the diagonal elements) that are zero; but that the diagonal elements of invdswitz are “infinity”, represented by Inf. Use diag to change them to 0. Spatialdistance neighbors. One way to create a spatial distance matrix that has nonzero entries for just the spatial neighbors is to multiply the distance matrix by the 0/1 indicator of spatial neighbors. Create a matrix of inverse distances for the neighbors in the queen scheme. Exercise 6.8 Nearest neighbors. Use rowSums on the binary (0/1) queenneighbors matrix to verify that the various cantons have different number of neighbors. In some applications it is useful to guarantee that each region has exactly k neighbors: the neighbors under this scheme are called knearestneighbors. Look at the documentation for the function knearneigh and create an object recording the single nearest neighbor for each of the Swiss cantons. Exercise 6.9 It is useful to visualize these neighbors. To do so:
Plot the single nearest neighbors of the Swiss cantons. It may be useful to develop a function to do this automatically: the function should take as an argument a nb object, and perhaps a title string. It should then run through the steps given above. Write a function to plot the neighbors, and ensure it gives you the right results. Exercise 6.10 Create and plot the set of two nearest neighbors of each of the Swiss cantons. We return now the USstates data. You have previously worked with the shapefile in s_01de10.zip . Restart R, load in the data from the last section of these exercises. Note that packages are not saved with a data save (even in internal format), so import the spdep and rgdal packages. Exercise 6.11 Create a neighbors list and then a binary adjacency matrix from this data. You may need to read up on the zero.policy argument for the nb2mat function. Warning: the neighbors list can easily take several (4+) minutes. Exercise 6.12 The binary adjacency matrix has, for each state or territory (row) a 1 if a particular state is adjacent to it, and zero otherwise. One obvious thing to check is whether the number of neighbors for each state is correct, and in particular, whether states listed as having zero neighbors are correct. Produce a list of the total number of neighbors for each state/territory, and match it up with the names of the geographical units (hint: cbind). Look at the entries with zero neighbors and conclude that there must be something wrong with the shapefile. As it turns out, this shapefile was downloaded from the US Weather Service site (http://www.weather.gov/geodata/catalog/national/html/usstate.htm), which should tell you that even seemingly authoritative data from the US government can be wrong. (The US data from the site hosting the Swiss data has a similar problem, as you can verify if you want to. Of course it is not clear that the sources of the two datasets are independent). You should always check your shapefile before doing any further work with it. Exercise 6.13 Here is another shapefile for the US states tl_2010_us_state10.zip , this one, downloaded from http://www.census.gov/cgibin/geo/shapefiles2010/layers.cgi. Restart R and go through the same steps you did with the last one, and convince yourself that this one is a correct representation of the adjacencies for US states. At the end of this, you should save the at least the neighbors object, since it takes so long to produce.
7 Merging Spatial DataThis exercise set asks you to merge data from the states shapefile (read this in as geo, say) and the 1999 crimes data (crimes), using spCbind from the maptools package. You are strongly advised to this exercise by creating the necessary commands in a script file, and then executing them (rather than just typing commands into the console). That way you will have a record of what you did right when the exercise is done; and that will be useful for later work, since the commands needed for any merge like this are essentially the same. In addition, it is easier to correct mistakes if the commands are in a script file. When doing a complicated exercise like this — or any complicated piece of real datamunging — it is a good idea to check at each stage that the command did what you think it should have done. For example, if you convert a factor variable f1 to numeric — as you may want to do — a quick check of f1 against the variable you’ve just created will show you that this isn’t what you expected. You want to learn this as soon as possible, rather than wait until other commands depending on that one fail. Note: though it is often convenient to keep your spatial and nonspatial data in one data frame, it is not necessary (and in some cases, eg when you have a paneldata setup, isn’t useful, since the spatial paneldata functions do not work with spatial dataframes). All you really need is for the observation order in the spatial and the nonspatial parts to match up. Begin by reading in the spatial data (including the neighbors object) you created at the end of the last set of exercises, and then the crimes data from section 3. Exercise 7.1 Use intersect on the (column) names of the dataframes to see if they have variables (columns) with the same name. If the result is other than character[0] you will need to modify some of the names to make them unique. You could exploit R’s casesensitivity here. Exercise 7.2 In order for spCbind to work, the two data frames must have the same row.names. We will use FIPS codes for this. Extract the fips codes into (say) geo.fips and crimes.fips Exercise 7.3 Problem: singledigit FIPS codes look different in the two variables. ("2" vs "02" for example). Since the fips variables are factors, a solution is (a) convert them to character (b) convert to numeric and (c) convert back to character. Do this and verify that it has solved your problem. Exercise 7.4 Create a variable, say, common, showing the FIPS codes the two data frames have in common. (Hint: see Exercise 7). Exercise 7.5 Remove any NA’s from common Exercise 7.6 Use match to find the positions of the elements of geo.fips in common and create a new data frame from geo based only on those positions. Exercise 7.7 Do the same for crimes. Exercise 7.8 Verify that this has worked: one way is to print out the state names from the two new data frames: they should be the same. Exercise 7.9 Reset the row names of the new data frames to be the common FIPS variable. Exercise 7.10 Merge the two new data frames using spCbind. Check a couple of rows against the original raw data to make sure that everything has worked. This is the part of the exercise that may not be necessary in every case, since, as noted above, all that really matters is for the geographical and nonspatial data to be in the same (here, FIPS) order. Exercise 7.11 You will also need to adjust the spatial neighbors object you created in the last set of exercises. You cannot do this via indexing, since the result is not a neighbors object. But you can do it using the function subset.nb. If you need to refer to the “region IDs” in the neighbors object you can do so via attrib(nb,"region.id"). So now adjust the neighbors object so that it has the same rows as the merged data frame Exercise 7.12 Save the merged data frame and its neighbors object.
8 Spatial AutocorrelationThe standard measure of spatial autocorrelation is Moran’s I statistic which is described in Bivand et al, Chapter 9.4. It is important to note that any conclusion for spatial autocorrelation is relative to a particular neighbors categorization, so that one set of spatial neighbors (queen, say) may yield a different conclusion for autocorrelation than another (rook, or one of the knearest neighbors arrangements). It is also important to understand that the Moran function in the spdep package will not work with “islands”, that is, regions that have no neighbors. One way to guarantee that is to use a knearest neighbors criterion for spatial contiguity; but if you don’t want to do that, you may need to pare down the dataset. Exercise 8.1 Read in the merged spatialpluscrimes data you created in the last set of exercises, and (if it is not saved in the same file) the neighbors object. Load in the spdep library. Exercise 8.2 Create a knearest neighbors from the coordinates of the merged data frame, for k = 1,2,3. Exercise 8.3 Run a Moran test for spatial autocorrelation one some of the crimescount variables (mur, vio) etc and for each of the nearestneighbors objects. Looking at the result, you need to figure out: what is the null hypothesis of the Moran test? So what do large values of the test statistic (low pvalues) tell you? Do the results agree, for all variants of the nearestneighbors matrices? Exercise 8.4 As noted above, Moran tests require a neighbors object with no islands. Try a test using the queenneighbors you have already created and read in. Note the error message, for further reference. Type the name of your queencontiguity matrix at the command prompt, and note that it will tell you if there are any islands in the data, and also their region id. Exercise 8.5 In order to do a Moran test with the queencontiguity matrix, you will need to pare down both the dataset and the neighbors matrix. Note that the island entries are for Alaska and Hawaii, as you would expect. Create a new spatial dataframe excluding these two observations (you can do this by indexing the original dataframe) and a neighbors object that does the same thing (using subset.nb). Check that the new neighbors object really is a neighbors object, and that it has no islands. Exercise 8.6 Repeat the Moran test for some of the crimecount variables, and the (pareddown) data. Try it with the logarithms of those variables, too. Exercise 8.7 Create and save (eg, model<lm(. . .) ) a linear (nonspatial) regression with the logarithm of one of the crimescount variables as the dependent variable, and include the v_shall variable, plus any others you think are relevant. Extract the residuals from this regression, and do a Moran test on these.
9 Spatial RegressionFor a single spatial crosssection, spdep provides you with functions lagsar to estimate the spatial autoregressive (SAR) model and errorsarlm to estimate the spatial error (SEM) model: Exercise 9.1 Load in the version of the crimes data that excludes Alaska and Hawaii (ie the “island” observations), and its associated neighbors object(s). Exercise 9.2 Run an SAR model on your data, to see if the conclusion for the v_shall variable is affected by a spatial formulation. Exercise 9.3 Do the same with the SEM model. Exercise 9.4 One question that has never been completely answered in the litarature is: which model (SAR or SEM) to ause in applied work? Here is a link to a paper by J. Paul Elhorst with a suggestion for how to determine this. How much of the strategy outlined here can be achieved with spdep functions?
10 Spatial Panel Data ModelsPackage splm provides estimators for a variety of spatial paneldata models, that is, where we have several spatial crosssections of data. The following exercises provide an introduction to these models. Here is a link to a paper intended to be published in the Journal of Statistical Software describing the package, which in my view is clearer than reading the help files for the package (and here is a cropped version suitable for 2up printing). This package in effect adds spatial panels to the normal paneldata estimation routines in package plm, and loads plm automatically. Important: For these exercises you are required to construct and save a script recording your solutions to the exercises, and you will be asked to submit that script as part of your final work product for the independent study (see next section). The dataset crimes_ts.zip contains a panel of crimes data for US states, over the period 19701999. The format is essentially that of the 1999 data you have already seen. Read it in. Note that you do not want to merge it with geographical data (eg from a shapefile) since the paneldata estimators cannot copy with Spatial*dataframes. Exercise 10.1 Remove all observations for the island (noncontiguous) states (ie remove all observations on Alaska and Hawaii). You should be left with 49 “regions” — the 48 contiguous states plus the District of Columbia. Exercise 10.2 For spatial panel data estimation, the data must be organized in a series of spatial cross sections: a crosssection of the 49 regions (here, states) in 1970, followed by a crosssection of the 49 regions in 1971, etc. However, the data as read in are organized by state: first the Alabama observations, then the Arizona observations etc. (Note that some state names are empty, but the FIPS listing is complete). Rearrange the data into the format needed for a spatial panel. (Hint: use the order function). Exercise 10.3 If you are going to do lots of paneldata work, it may be more convenient to create a paneldataaware form of the data. The function plm.data allows you to do this. You need to supply an argument index=c("a","b")whose arguments (a and b) are the names of variables naming the individuals (here, the states) and the time periods (in that order). Create a paneldataaware form of the data, called, say pdata. You will not want to use the state variable to name the individuals, since it contains blanks (missing data). Note that you do not have to do this: you can just use the original form of the data, but then your commands to the paneldata estimation routines will need to include the index= argument directly. Exercise 10.4 Read in a neighbors object corresponding to the 49states data. Exercise 10.5 Use the plm function to estimate a nonspatial fixedeffects paneldata model for the logarithm of one of the crime variables. Include v_shall among the independent variables. Save the summary results and a listing of the fixed effects to a text file. Hints: look at the help for R’s sink function ; you can extract the effects by eff<effects(mod) where mod is the object to which you assign the output of the regression model. Since you will be using this specification for much of this set of exercises, you may want to learn how to save the specification as a variable that can be reused: look up the help for as.formula. Exercise 10.6 Estimate a SAR model with spatial fixedeffects using the same specification you used in the previous exercise. Save the results (including the listing of the fixed effects) to a second text file. Hints: if you get an error involving “object ’id’ not found, make sure that you provide the neighbors object in the form listw=nb2listw(nb), rather than just nb2list2(nb). Exercise 10.7 Repeat with both spatial and nonspatial time fixed effects. Exercise 10.8 Estimate an SEM model with (a) spatial fixedeffects; (b) time effects and (c) both. How do these compare to the corresponding nonspatial paneldata models? Exercise 10.9 Estimate a model with spatial and temporal fixed effects, but this time using shalll instead of v_shall. How many time effects have you estimated? What’s going on here? The problem here is that shalll has missing data, and the splm routines handle missing data by removing the spatial crosssections corresponding to where there are missing temporal data. A recent paper by Pfaffermeyr discusses how one can estimate spatial paneldata models with missing data (ie an unvbalanced panel)while allowing you to retain as much of the data as you can; but this has not been implemented in splm.
11 DeliverablesFor your final grade in the independent study please send me a zip file containing:
