The visualization of spatial data is one of the most popular applications when using R. This tutorial is an introduction to the visualization of spatial data and creation of geographic maps with R
The focus will be on the following subjects:
- Import and use of shapefiles in R (the shapefile is a file format designed for spatial data).
- Data management and restructuring of data.
- The creation of maps.
- Visualization of data on maps.
- Enrichment of created maps using the spatial information of static maps from OpenStreetMap.de.
This tutorial shows the creation of a map of the different districts of Kassel including the number of households within each district. This type of map is also known as Choropleth map. Going even further, this map will also contain the stations of the Kassel bicycle sharing system “Konrad”.
Import of shapefiles
Information about the map borders are stored in a shapefile, which consists of a minimum of three files:
- A file with the extension “shp” containing the geometric information required for the creation of the map.
- A file with the extension “dbf” containing the attributes of the geometric figures, which in our case will consist of the postal codes and the respective city names.
- An index file with the extension “shx”, which enables us to filter the geometric data according the respective attributes. In our case, we have the postal codes for entire Germany, but we do just need an excerpt with postal codes linked to Kassel.
In R, a shapefile may, amongst other options, be imported with the command readShapeSpatial using the package “maptools”, which will result in an S4 object of the class Large SpatialPolygonsDataFrame. In comparison to list objects, S4 classes provide a formal object structure and enable an object oriented workflow. These objects can be further indexed just like a data frame.
shapes <- readShapeSpatial(„plz/post_pl.shp“)
#subsetting only Kassel shapes
ks_shapes <- shapes[shapes$PLZORT99 == „Kassel“,]
The command plot(ks_shapes) visualizes the outlines in R.
This rudimentary map will now be enriched with further data.
The dataset stems from the website datamarket.com and was publicized by the department of Statistics. The data was given for neighborhoods, while the map is based on postal codes. Therefore, another data set is necessary to assign the appropriate neighborhoods to their appropriate postal codes. Such an assignment can be found on Wikipedia, for example (http://de.wikipedia.org/wiki/Kategorie:Stadtteil_von_Kassel).
Loading the data into R with read.csv2, it will look like this:
districts_ks <- read.csv2(„data/mapping_postalcodes_districts_kassel.csv“,
stringsAsFactors = FALSE)
districts_data <- read.csv2(„data/districts_information.csv“, stringsAsFactors = FALSE)
District Households private.used.cars.per.1000.residents
In order to combine two different data sets with each other, it is necessary to merge the district variables into a unified structure. For example, in one data set the description is “Philippinenhof-Warteberg“, while in the other the description is „Philippinenhof/Warteberg“. Hence, in these small data sets we have to change the names by simple indexing and assigning operations. (For larger datasets, a systematical data cleaning by parameters like regular expressions should be conducted to search for patterns in a text.)
#correction of district names in districts_data
districts_data$District[districts_data$District == „Nord (Holland)“] <- „Nord-Holland“
districts_data$District[districts_data$District == „Philippinenhof/Warteberg“] <- „Philippinenhof-Warteberg“
districts_data$District[districts_data$District == „Süsterfeld/Helleböhn“] <- „Süsterfeld“
districts_data$District[districts_data$District == „Wolfsanger/Hasenhecke“] <- „Wolfsanger-Hasenhecke“
The two datasets are merged.
districts_data <- merge(districts_data, districts_ks, all.x = TRUE)
The dataset looks like:
Usually, duplicate entries for district and postal code will occur. Consequently, the next step will have to be a summary of the data set so that we will get precise singular entries for the variable postal code. In order to summarize the data set, we add up the respective values of the households, and write the corresponding districts into a string. For the aggregation, we use the data management package dplyr, which has its own syntax for data management. An important component is the pipe operator %>%, which originates from the magrittr package, and which enables function calls to be chained together in a row, without manually calling upon the dataset each and every time.
#aggregation by postalcode
summarise(District = paste(unique(District),collapse = “ & „) , Households = sum(Households),
cars = sum(private.used.cars.per.1000.residents))
The resulting dataset looks like:
Since we do not know how the distribution of the population across the districts corresponds to the respective distribution of the postal code areas, this method will naturally lead to a certain imprecision. For example, how many of the 5.040 households in the downtown district belong to the postal code 34121 and how many to 34117?
For the data to be finally visualized on the map we have created, it still needs to be categorized.
Using the cut function, we create a factor that contains the appropriate classes. Next, we assign names to these classes.
classes <- cut(data.agg$Households[position], c(0,6000,10000,15000,20000,25000))
levels(classes) <- c(„under 6000“, „6000 – 10000“, „10000 – 15000“, „15000 – 20000″,“20000 – 25000“)
We create a color vector that contains 5 different shades of green.
colours<- c(„#73AC97“, „#489074“, „#267356“, „#0E563B“, „#003924“)
To display the map, we use the plot function in base R. First, we set global parameters for the output device.
par(mar = c(0,0,1,9),oma = c(3,0,2,1),
bg = „lightgrey“,xpd = TRUE)
Then, we have to define the inner margins of the plot with mar = c(0,0,1,9). We do not want to have an inner margin at the bottom and on the left side, however we would like to have an inner margin of about one row at the top and slightly bigger margin of around nine rows on the right hand side for our legend. We will define the size of outer frame by oma = c(3,0,2,1) we define the size of the outer edges. bg = „lightgrey“ will specify the color of the background for our graphic as light grey. By setting the argument xpd = TRUE, we ensure that the legend is displayed inside the inner frame.
To visualize the map, we will use the plot function, we will then surround the postal code areas with dark gray and pass our color vector as colors sorted by classes (number of households in classes).
plot(ks_shapes,border = „darkgrey“, col = colours[classes])
The function text(coordinates(ks_shapes), labels =ks_shapes$PLZ99_N, col = „white“) adds labels to the areas in the form of postal codes.
A legend can be added by a function called legend().
legend(x = par(„usr“),y = par(„usr“), pch = 15, col = colours, legend = levels(classes),ncol = 1, bty = „n“, title = „Households“,
xjust = -3.4, yjust = -1.8)
With title(„Households in Kassel“), we can add a title of the plot
Integration of information from openstreetmap.de
The information about the coordinates of the stations of bike rental system Konrad come from a osm file which relates to the urban area of Kassel. The file itself is constructed in a xml structure.
To extract the Konrad coordinates from this file, there are two different ways:
osm parsen with the osmar package
The first way is to use the osmar package, which handles openstreetmap data. The commands osmsource_file and get_osm load the data and store the information into an osmar object.
src <- osmsource_file(„C:/Users/msc.EODA/workspace/geodaten_visualisierung/Kassel.osm“)
OBJ <- get_osm(complete_file(), source = src)
However, it should be noted that this is done for all stored information, not just for the desired Konrad stations. Accordingly, the process requires quite a lot of computing power and the resulting object is almost 40 MB large.
If you want to receive the corresponding nodes of the Konrad stations, search with find(OBJ, node(tags(v %agrep% „Konrad“))), the matching node IDs of the Konrad nodes and build an appropriate subset of the osmar object with subset(OBJ, node_ids = konrad_id) that only contains the Konrad node.
Osm parsen with xpath
A more efficient method for our scenario is an xpath query. Xpath is a query language to extract information from xml objects. The xml package provides a functional environment for this use case within R.
With xmlparse (“Kassel.osm”) the document will be read into R, the next step would be to send a query for the latitudes and longitudes and write the results directly into a data.frame.
stations <- data.frame(lat = xpathSApply(osm, path = „//node[tag[@v = ‚Konrad‘]]/@lat“),
lon = xpathSApply(osm, path = „//node[tag[@v = ‚Konrad‘]]/@lon“),
stringsAsFactors = FALSE)
We pass an xpath command as argument to path in order to search which nodes there are with a tag node which contains an attribute v, of which the value is Konrad and obtain an output value of the attribute lat from this node.
We add the Konrad stations as White dots with points(stations$lon, stations$lat, col = adjustcolor(„white“,0.4),pch = 19)
to the already existing plot and adjust the transparency of the dots with adjustcolor(„white“,0.4).
Finally, we have to add a reference.
legend(par(„usr“),par(„usr“), pch = 19, col = adjustcolor(„white“,0.4), legend = „Konrad\nStations“, bty = „n“,
xjust = -5.1, yjust = -4)
Finally, we have to add a reference.
mtext(„Sources:\ndatamarket.com\nhttp://arnulf.us/PLZ\nopenstreetmap.de“, side = 1, line = 1, outer = TRUE, cex = 0.7, col = „darkgrey“)
Use the mtext function to write text into the margins of a plot side = 1
describes the bottom side of the plot, line = 1
means the second line (the function begins counting at 0) outer = TRUE
writes the text to the outer margin, finally cex and col define the font size and font color.
mtext as a function stands for margin text and enables us to write text into the margins of our graphic.
- side = 1 describes the bottom side of the plot.
- line = 1 means the second line (the function begins counting at 0).
- outer = TRUE specifies the writing of the text to occur in the outer margin.
- Finally, cex and col define font size and color.
The code and the corresponding data are available at GitHub.
The visualization of analytical results as well as the creation of high quality and interactive graphics with R is one of the subjects of the training program of the eoda Data Science Trainings.