What are geospatial data?

Geospatial data are measurements or recordings that are associated with a location on Earth. The most familiar type of location information are geographical coordinates given in degrees latitude and longitude and defining a point on the Earth’s surface. However, geospatial data may be associated with line-transects, aerial regions or continuous spatial fields depending on the application.

Illustration of latitude and longitude

Give a GPS, a latitude and a longitude, we can get pretty close to the intended location. Much of the uncertainty in the location will be due to the properties of the GPS itself, however some will also derive from the fact that any geospatial location is really a model with no exact corresponding point on the Earth’s surface. With a lot of fancy tricks, one can get down to centimeter-scale accuracy. The important point to understand is that just like any scientific measurement, a specification of location is uncertain and depends on a model.

The other type of coordinates commonly encountered are cartesian coordinates, sometimes also called map coordinates as they are used when the surface of the Earth is projected onto a flat plane as in printing a map. We will consider the details of map projects later.

Illustration of cartesian map coordinates

The two main types of geospatial data

In the development of geographical information systems, two major ways of storing geospatial data emerged. You will generally find that data come in either raster form or vector form.

Comparison of raster and vector GIS data structures

Raser data represent spatial fields of repeated measurements. These are the types of data that you download from your digitial camera, which records the intensity of red, green and blue light in a regular grid. Similarly, satelite imagery records electromagnetic radiation of various wavelengths reflected off the Earth’s surface. Another example, would be a regular grid of topographic elevations called a Digital Elevation Model (DEM).

In the computer, a raster image is stored as a sequence of values where each successive value is the next pixel along a row (or column) and rows (columns) are appended sequentially. Most software provides a grid or matrix-like view of the data where you can extract rows, columns or tiles of values. When combined with spatial metadata associating the grid with a location, one can find the coordinates of each pixel by the count of rows and column offset from a reference, typically the location of one of the corner pixels (and information about rotation of the grid off of true north). In R, raster data are most commonly manipulated with the raster package.

Vector data are stored as coordinate pairs or points. In GIS, we usually use 2D points representing a location on the Earth’s surface, however we can generalize to 3D location-elevation or 4D space-time coordinates. Note that a sequence of points forms a linestring and a linestring enclosing a region forms a polygon. These are the three principal data structures in vector GIS: point, linestring and polygon. Most software will provide access to a collection of points, linestrings or polygons as a sequential list where each element can be accessed in turn or by offset. Additionally, it is usually possible to access and manipulate each point individually in a linestring or polygon. Manipulating vector data is usually done with the sf package.

Example point, linestring and polygon

The major difference between vector and raster is that with vector data, the coordinates of every single measurement are contained in the dataset, whereas with raster data, the only a single coordinate pair (or in many cases one for each corner pixel) is contained in the dataset. I call this difference explicit versus implicit referencing. The raster format omits this extra information to save space and make lookups much faster than if we had to search for a particular location among all the pixel coordinates. In computerese, searching for vector coordinates grows linearly (\(\mathcal{O}(n)\)) with the size of the dataset (or logarithmically—\(\mathcal{O}(\log n)\)—in the case of hiearchical indexes) and searching for pixels in a raster in a constant-time operation (\(\mathcal{O}(1)\)). Note that it is trivially easy to convert a raster dataset to a vector dataset by simply replacing the set of pixels with a set of explicit points, each labeled with the pixel value. This will expand the memory requirments for manipulating the data, but is nonetheless sometimes convenient for small to maderately sized rasters.

Working with R

R is a computer language designed to facilitate data analysis and model building. It originated as a free and open source alternative to S+, a commercial implementation of S+. The S language was designed by John Chambers at ATT Bell Labs. R has proven to be spectacularly successful and is now essentially required knowledge in data-driven research fields. A set of slides giving an overview can be downloaded here (PDF format).

There are a vast array of resources available for learning R, so I will not attempt to teach the language here. Your main waypoint in getting started is the R website, and of course Google is your friend. Also look at this, this, and this; because of excellent writing, they are more approachable for beginners than their titles suggest.

As a quick example of what you might do in R, consider:

First load some needed packages.

library(tibble)
library(ggplot2)

Now do some work.

x <- rnorm(20)
y <- 2 * x + 1 + rnorm(20)
xy <- tibble(x, y)
p1 <- ggplot(xy, aes(x = x, y = y)) + geom_point()
print(p1)

Here we have simulated some data and made a plot. Now lets build a model.

m1 <- lm(y ~ x, data = xy)
summary(m1)
#> 
#> Call:
#> lm(formula = y ~ x, data = xy)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.4622 -0.5534  0.1979  0.3900  1.0296 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   0.9404     0.1582   5.945 1.26e-05 ***
#> x             1.6191     0.1847   8.764 6.53e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7039 on 18 degrees of freedom
#> Multiple R-squared:  0.8101, Adjusted R-squared:  0.7996 
#> F-statistic: 76.81 on 1 and 18 DF,  p-value: 6.529e-08
intercept <- coefficients(m1)[1]
slope <- coefficients(m1)[2]
p1 <- p1 + geom_abline(slope = slope, intercept = intercept, color = "steelblue")
print(p1)

These examples us the ggplot2 and tibble packages that are part of the tidyverse, a suite of modern tools for R that follow a consistent ‘tidy’ approach. A complete guide is here and is highly recommended.

To get help in R, try the following:

> help("ls")
> ?ls
> help.search("directory")
> ??directory

Note: there is a bug in the latest version of RStudio which causes help searches to fail! You can also use Google.

Using RStudio

You should be working in RStudio. I highly recommend forming two habits whenever you are using RStudio.

First, always create a project. Go to File > New Project… in the menus and create a project. Put all of your code and results in the project directory or in subdirectories under the project directory. I keep my projects on cloud storage like Dropbox or Box. When writing packages, I also sync with GitHub.

Second, do not type code directly into the console. I highly recommend that you always create a file and place your commands into that file. RStudio makes it easy to run those commands singly or all at once. There are three choices in the File > New File menu: R Script, R Notebook, and R Markdown…. I recommend using the notebook interface as it lets you intermix markdown text with R code chunks. However, many first-timers prefer a simple script with embedded comments as it is simpler at first. Use an R markdown file to create standalone documents. The RStudio IDE has buttons that all you to run single lines of code, selected lines of code, or all lines in a file.

Some resources to help you get started

Getting Started

The R landscape is quite expansive with little hamlets of documents interspersed across a vast wilderness of code. This guide aims to get you oriented.

R Project and CRAN

The first and most important resource is the R Project website. There you will find information on how to download and install R (you may wish to start with RStudio rather than install R directly), some introductory manuals, a few pointers on finding help and critically a link to the Comprehensive R Archive Network, a repository containing >10,000 add-on software packages for R.

Know that the manuals from the R Project, while comprehensive, are not particularly accessible and, if you do read them, are much, much easier to read in PDF format than in HTML. I will point you to some better sources later in this document. CRAN does have some useful Task Views, which are curated lists of packages according to specific topics.

RStudio

The next most important resources is RStudio. Most users now work with R in RStudio and we will focus exclusively on this approach. If you do not have RStudio installed, download it here. It should have everything you need to get going. Their Online Learning section has many good resources for learning R and related tools. For this course, basic R, navigating RStudio and using R Markdown are important components. Definitely checkout their collection of cheatsheets. RStudio also has a Community Website where you can ask questions and meet other RStudio users.

Wickham books

Handly Wicham has produced some excellent reference materials covering R programming and Data Science with R. Although these references cover advanced topics, they are quite accessible and well-written.

Other resources

A popular place to ask questions is Stack Overflow. The R community also maintains some email lists. Generally you will want to do a good bit of searching for answers before posting questions on these sites.

Recommendations

For those without any R background, work through the basic Data Camp R tutorial. Another excellent tutorial for first-timers is Try R. You might look at R Examples.
Read up on creating R Markdown documents. I urge you to use R Markdown as your default way to interact with R. This is the gateway to reproducible research.
Next you might consider working through the first few chapters of the R for Data Science book and, in particular, the chapter on visualization. This may seem a steep hill to climb however making plots is really an entertaining way to begin working in R.
Check out the chapters on data structures, subsetting and functions in Advanced R. The presentation is quite accessible to beginners.
Take a look at the chapters on transformations, tibbles and pipes in R for Data Science. These are building blocks for applying high-level functions to tabular data.

With a few hours of effort, you should be able to begin using R. Remember that it takes years to become an expert, so be patient!

Introduction