China Urbanisation & Large cities

By Datapleth.io | October 19, 2015

In this article we are going to plot a map of China urbanization rate per provinces together with Chinese cities with at least 2 millions population.

In a nutshell, we’ll get first rural and urban population data from official China statistic bureau, then clean the data, we’ll repeat the same two steps for Chinese largest cities. Secondly, we’ll prepare a map of China with provinces. Then we will add the main Chinese cities and their population and a choropleth of urbanization rate, add main cities

library(reshape2)
library(ggplot2)
library(maptools)
library(maps)
library(knitr)
library(kableExtra)
library(dplyr)

Rural and urban population of China

The best source we found is the official China statistic bureau. You can follow the following link to download the data : http://data.stats.gov.cn/english/

We will need three files, Rural, Urban and Total resident per provinces.

Population data for 2000 and 2001 are estimated on the basis of population census, the rest of the data are estimated on the basis of the annual national sample surveys on population changes. Population data by region are permanent resident population since 2005.

## data.stats.gov.cn files tend to be formatted with similar patterns
loadPolupationData <- function(dataFile,mapVariable = "population"){
        dF <- read.csv(dataFile ,header = TRUE, stringsAsFactors = FALSE, skip = 3)
        ## Remove NA values
        dF <- dF[complete.cases(dF),]
        ## Melt as long format
        dF <- melt(dF, id.vars = "Region")
        ## get the proper year format
        dF$variable <- as.numeric(gsub("^X", "", dF$variable))
        ## get the proper province names
        dF$Region <- gsub("^Inner Mongolia$", "Nei Menggu", dF$Region)
        dF$Region <- gsub("^Tibet$", "Xizang", dF$Region)
        ## Get the population in million of people
        dF$value <- (dF$value * 10000)/1000000
        names(dF) <- c("Region", "Year", mapVariable)
        dF }

provincePopulationT <- loadPolupationData(
        dataFile = "https://data.datapleth.io/ext/province-population/China-Total-Resident-per-province.csv",
        mapVariable = "Total.Population"
        )
provincePopulationU <- loadPolupationData(
        dataFile = "https://data.datapleth.io/ext/province-population/China-Urban-Resident-per-province.csv",
        mapVariable = "Urban.Population"
        )
provincePopulationR <- loadPolupationData(
        dataFile = "https://data.datapleth.io/ext/province-population/China-Rural-Resident-per-province.csv",
        mapVariable = "Rural.Population"
        )

## merge in one data frame
provincePopulation <- merge(provincePopulationT, provincePopulationU)
provincePopulation <- merge(provincePopulation, provincePopulationR)
rm(provincePopulationR, provincePopulationT, provincePopulationU)

## Add index
provincePopulation$Urbanisation.rate <- provincePopulation$Urban.Population / provincePopulation$Total.Population
## Melt as long format
provincePopulation <- melt(provincePopulation, id.vars = c("Region", "Year"))

We use a generic function to load and clean the data and we plot an overview bellow in a quick bar plot :

## select urbanisation rate as variable
dF <- provincePopulation[provincePopulation$variable =="Urbanisation.rate",]
## fix the order using 2014 reference by urbanisation rate decreasing
ordered.label <- dF[order(dF$variable, dF$value) , ]
ordered.label <- ordered.label[ordered.label$Year == "2014",]$Region

g <- ggplot(dF, aes(x=Region, y=value)) +
        geom_bar(stat = "identity", fill="chartreuse4") +
        coord_flip() + facet_grid(facets = . ~ Year) + 
        scale_x_discrete( limits=ordered.label ) +
        theme(axis.text.x = element_text(angle = 90))
print(g)

Map of China with provinces, municipalities and autonomous regions

We need now to load the polygon shape data frame for china level one subdivisions, merge it with urbanization rate data.

ChinaPolygonsLevel1 <- maptools::readShapeSpatial(fn = "./data/CHN_adm1.shp")

## Fix English names, simplify
ChinaPolygonsLevel1@data$NAME_1 <- as.character(ChinaPolygonsLevel1@data$NAME_1)
ChinaPolygonsLevel1@data[grep("Xinjiang Uygur", ChinaPolygonsLevel1@data$NAME_1),]$NAME_1 <- "Xinjiang"
ChinaPolygonsLevel1@data[grep("Nei Mongol", ChinaPolygonsLevel1@data$NAME_1),]$NAME_1 <- "Nei Menggu"
ChinaPolygonsLevel1@data[grep("Ningxia Hui", ChinaPolygonsLevel1@data$NAME_1),]$NAME_1 <- "Ningxia"
ChinaPolygonsLevel1@data$NAME_1 <- as.factor(ChinaPolygonsLevel1@data$NAME_1)

# Use level1 as index & Province name as id
ChinaLevel1Data <- ChinaPolygonsLevel1@data
ChinaLevel1Data$id <- ChinaLevel1Data$NAME_1

# Fortify the data (polygon map as dataframe) using english names
ChinaLevel1dF <- fortify(ChinaPolygonsLevel1, region = "NAME_1")

## Merge polygons and associated data in one data frame by id (name of the province in chinese)
ChinaLevel1 <- merge(ChinaLevel1dF, ChinaLevel1Data, by = "id")
rm(ChinaPolygonsLevel1, ChinaLevel1dF, ChinaLevel1Data)

## Create the ggplot using standard approach
## group is necessary to draw in correct order, try without to understand the problem
g <- ggplot(ChinaLevel1, aes(x = long, y = lat, fill = ENGTYPE_1, group = group))
## projected shadow
g <- g+ geom_polygon(aes(x = long + 0.7, y = lat - 0.5), color = "grey50", size = 0.2, fill = "grey50")
## Province boundaries
g <- g + geom_polygon(color = "white", size = 0.2)
## to keep correct ratio in the projection
g <- g + coord_map()
g <- g + labs(title = "China - level 1 subdivisions")
print(g)

Main Chinese Cities data

We want to add on the map the main Chinese cities with their population. There are multiple sources for such data but most of them requires intensive data processing. This will be the topic of another article. For this post, let’s use available data in R and the world.cities data set of the packages maps.

data("world.cities")
## Get cities of China
mainCitiesOfChina <- world.cities[world.cities$country.etc == "China",]
## convert population in millions
mainCitiesOfChina$pop <- mainCitiesOfChina$pop / 1000000

Let’s have a look on the data.

# we use kable to generate an html table with nice formatting
knitr::kable(head(mainCitiesOfChina[mainCitiesOfChina$pop > 2,],4)) %>%
  kable_styling(
    bootstrap_options = c(
      "striped"
      , "hover"
      , "condensed"
      , "responsive"
      )
    ) %>% scroll_box(width = "100%")

	name	country.etc	pop	lat	long	capital
3826	Beijing	China	7.602069	39.93	116.40	1
7108	Changchun	China	2.602574	43.87	125.35	3
7118	Changsha	China	2.131620	28.20	112.97	3
7271	Chengdu	China	3.961709	30.67	104.07	3

Plot China map

We will proceed in two steps, we will make first a choropleth by province of urbanization rate and then we will add cities.

Provinces, urbanisation rate choropleth

We should map the urbanization rate from the provincePopulation dataframe with province of the maps defined by ChinaLevel1. Let’s first check that all provinces are covered and names are matching.

a <- unique(provincePopulation[provincePopulation$variable == "Urbanisation.rate",]$Region)
b <- unique(ChinaLevel1$id)
all(a %in% b)

## [1] TRUE

This looks good, we can proceed and plot the map.

## get large version of the urbanisation dataset
provincesUrbanisation <- dcast(provincePopulation, Year + Region ~ variable, value.var = "value")
## Merge polygons and associated urbanisation data in one data frame by id (name of the province in chinese) and Region - Only for year 2014
urbanisationChoropleth <- merge(ChinaLevel1, provincesUrbanisation[provincesUrbanisation$Year == "2014",], by.x = "id", by.y = "Region")
rm(ChinaLevel1)

## Create the ggplot using standard approach
## group is necessary to draw in correct order, try without to understand the problem
g <- ggplot(
        urbanisationChoropleth,
        aes(x = long, y = lat, fill = Urbanisation.rate, group = group)
        )
g <- g + scale_fill_continuous(na.value = "grey80", low = "#ffdddd", high = "#ff3311", name = "Urbanization %")
## projected shadow
g <- g + geom_polygon(aes(x = long + 0.7, y = lat - 0.5), color = "grey50", size = 0.2, fill = "grey50")
## Province boundaries
g <- g + geom_polygon(color = "white", size = 0.2)
## to keep correct ratio in the projection
g <- g + coord_map()
g <- g + labs(title = "China - Urbanisation rate")
print(g)

Adding Largest cities

Then we just need to add another layer with bubbles for each cities larger than 2 millions inhabitants.

## Theme configuration, as simple a possible
r <- g + theme_bw() + theme(axis.text = element_blank(),
              axis.title = element_blank(),
              axis.ticks = element_blank(),
              legend.text = element_text(size = rel(0.7))
              )
## Bubles for cities, we superpose two type of geom_point, circles an disks with alpha
r <- r + geom_point(
        data = mainCitiesOfChina[mainCitiesOfChina$pop > 2,],
        aes(x = long, y = lat, size = pop, fill = NULL, group = NULL ),
        colour = "blue", alpha = 1, pch = 1
        )
r <- r + geom_point(
        data = mainCitiesOfChina[mainCitiesOfChina$pop > 2,],
        aes(x = long, y = lat, size = pop, fill = NULL, group = NULL ),
        colour = "blue", alpha = 0.2, pch = 16
        )
r <- r + scale_size_area(max_size = 12, breaks = c(2,5,10), name = "Population in millions")
## Add city names, hjust is a small adjustment for better readability
r <- r + geom_text(
        data = mainCitiesOfChina[mainCitiesOfChina$pop > 2,],
        aes(x = long, y = lat, label = name, size = 0.5, group = NULL, fill = NULL, hjust = -0.18),
        show.legend = FALSE
        )
r <- r + guides(colour = guide_legend(nrow = 2), size =  guide_legend(order = 1, nrow = 1))
r <- r + labs(title = "China urbanization rate and cities above 2 millions pop.")
print(r)

Code information

Source code

The source code of this post is available on github

Session information

sessionInfo()

## R version 3.6.1 (2017-01-27)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /home/travis/R-bin/lib/R/lib/libRblas.so
## LAPACK: /home/travis/R-bin/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.8.3      kableExtra_1.1.0 knitr_1.26       maps_3.3.0      
## [5] maptools_0.9-9   sp_1.3-2         ggplot2_3.2.1    reshape2_1.4.3  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5  xfun_0.11         purrr_0.3.3       lattice_0.20-38  
##  [5] colorspace_1.4-1  vctrs_0.2.0       htmltools_0.4.0   viridisLite_0.3.0
##  [9] yaml_2.2.0        rlang_0.4.2       pillar_1.4.2      foreign_0.8-72   
## [13] glue_1.3.1        withr_2.1.2       lifecycle_0.1.0   plyr_1.8.5       
## [17] stringr_1.4.0     rgeos_0.5-2       munsell_0.5.0     blogdown_0.17.1  
## [21] gtable_0.3.0      rvest_0.3.5       mapproj_1.2.6     codetools_0.2-16 
## [25] evaluate_0.14     labeling_0.3      highr_0.8         Rcpp_1.0.3       
## [29] readr_1.3.1       scales_1.1.0      backports_1.1.5   webshot_0.5.2    
## [33] farver_2.0.1      hms_0.5.2         digest_0.6.23     stringi_1.4.3    
## [37] bookdown_0.16     grid_3.6.1        tools_3.6.1       magrittr_1.5     
## [41] lazyeval_0.2.2    tibble_2.1.3      crayon_1.3.4      pkgconfig_2.0.3  
## [45] zeallot_0.1.0     xml2_1.2.2        assertthat_0.2.1  rmarkdown_1.18   
## [49] httr_1.4.1        rstudioapi_0.10   R6_2.4.1          compiler_3.6.1