By Datapleth.io | March 9, 2016
Winter 2016 - For the first time ever, the city of Shanghai (and probably of Nanjing) entered the year of the monkey with a very quiet night. For safety and environmental reasons, the usual fireworks and firecrackers hysteria was forbidden downtown by the local government. (read the full story
This is a very big cultural change which was implemented very strictly. Fireworks and other noisy traditional firecrackers are used for important events in China, such as weddings, opening of shops or companies and lunar new year celebrations. They are used to chase away evils and bring good luck and prosperity for the future.
Our objective is to find out if GPD growth is improved by the amount of fireworks used during Chinese new year or if at the contrary, less improvement in economy lead to less frantic nights and less fireworks.
In this post we will try with a bit of data hack and visualization :
- to confirm and illustrate the impact of fireworks on air quality during the new year’s eve
- to quantify the amount of fireworks based on air quality change during that night
- to confirm links between GDP growth (the main economical indicator of China economy) of the country and the different cities with amount of fireworks used during new year’s eve
- Get and clean PM2.5 from main cities (see previous post here)
- Get and clean dates of past Chinese new year eve
- Get and clean GDP growth data for China and its main cities
- Prepare a clean data set with days before and after new year
- Exploratory analysis
- Evaluate impact of fireworks on PM2.5
- Estimate the amount of fireworks used in each cities
- build a model PM2.5 > amount of fireworks > GDP growth
- analysis and conclusions
In this first part we will cover steps 1 to 4.
We need several R packages for this study.
library(lubridate) library(dplyr) library(ggplot2) library(ggthemes) library(reshape2) library(data.table) library(rvest) # for webscrapping library(plotly) library(RColorBrewer) # Set-up locale for date format processing Sys.setlocale("LC_ALL","C")
##  "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
Getting and cleaning data :
We need PM2.5 readings, Chinese new year dates & GPD growth of main cities
Chinese new year dates
url <- "https://en.wikipedia.org/wiki/Chinese_New_Year" table_xpath <- '//*[@id="mw-content-text"]/div/table' cny <- url %>% read_html() %>% html_nodes(xpath= table_xpath) %>% html_table(fill = TRUE) ## keep only what we want cny <- cny[] cny <- cny[-1,1:4] names(cny) <- c("year", "date","animal","weekday" ) cny$date <- paste(cny$date, cny$year) cny$date <- as.character(strptime(cny$date, "%d %b %Y")) head(cny)
## year date animal weekday ## 2 2001 2001-01-24 Snake Wednesday ## 3 2002 2002-02-12 Horse Tuesday ## 4 2003 2003-02-01 Goat Saturday ## 5 2004 2004-01-22 Monkey Thursday ## 6 2005 2005-02-09 Rooster Wednesday ## 7 2006 2006-01-29 Dog Sunday
## extract the New Year eve and night cny$eve <- as.character(as.Date(cny$date)-1)
We will simply reuse directly the dataset generated in(our previous post)
load(file = "./data/aqi-1.Rda") names(aqi)
##  "country" "city" ##  "date.time" "date" ##  "year" "month" ##  "day" "hour" ##  "measurement.available" "pm2.5" ##  "aqi"
# Remove Paris aqi <- aqi[city != "Paris"]
The codebook for this dataset will be soon available. We only keep here data for chinese new year date and eve.
aqicny <- aqi[date %in% cny$date | date %in% cny$eve] aqicny[ , time := hour ] # keep only hours aqicny[ date %in% cny$date, time := time + 24 ] # add 24 for NY day aqicny[ , time := time - 24] # re-center on mid-night aqicny[ , year := as.factor(year) ]
Let’s visualise the evolution of particles during this period. We clearly notice a recurring pattern with peak of pollution around midnight !
g <- ggplot(data = aqicny) + theme_tufte() + theme(legend.position = "none") + geom_point(aes( x = time,y = pm2.5, col = city), size = 0.5) + geom_vline(xintercept = 0, color = "grey") + facet_grid(facets = city ~ year) + ggtitle("Peak of pm2.5 during chinese new year eve (fireworks)") + xlab("Hours from new year's midnight") + theme(axis.text.x = element_text(angle = 90)) + theme(axis.title.x = element_text(margin = margin(t = 20, r = 0, b = 0, l = 0))) + ylab("pm2.5 in micro g/m3") + scale_color_brewer(palette="Dark2") #Dark2 as alternative p <- plotly::ggplotly(g) ## Fix plotly label spacing ## thanks : ## https://stackoverflow.com/questions/42763280/r-ggplot-and-plotly-axis-margin-wont-change p[['x']][['layout']][['annotations']][][['y']] <- -0.06 p[['x']][['layout']][['annotations']][][['x']] <- -0.06 p %>% config(displayModeBar = F)
#%>% layout(margin = list(l = 75))
The best source we found is the official China statistic bureau. You can follow the following this link to download the data (Choose National Account / Gross Regional Product Index) - you will need to register to download the file.
## data.stats.gov.cn files tend to be formatted with similar patterns dataFile <- "https://data.datapleth.io/ext/GDP/GDP_Growth_AnnualbyProvince.csv" dataFile <- "https://datapleth.sfo2.digitaloceanspaces.com/ext/GDP/GDP_Growth_AnnualbyProvince.csv" ## Not very elegant here but fread returns strange characters which breaks Knitr gdpGrowth <- as.data.table( read.csv( dataFile, header = TRUE, stringsAsFactors = FALSE,skip = 3 ) ) ## Remove NA values gdpGrowth <- gdpGrowth[complete.cases(gdpGrowth)] ## Fix dF names setnames(gdpGrowth, c("Region", gsub("^X", "", names(gdpGrowth)[-1]))) ## get the proper province names gdpGrowth[ , Region := gsub("^Inner Mongolia$", "Nei Menggu", Region) ] gdpGrowth[ , Region := gsub("^Tibet$", "Xizang", Region) ] ## Melt as long format gdpGrowth <- melt(gdpGrowth, id.vars = "Region") ## Get the GDP Growth in % gdpGrowth[ , value := value - 100 ] setnames(gdpGrowth, c("Region", "Year", "provincial.gdp.Growth")) ## Get the year in numeric instead of factor gdpGrowth[ , Year := as.numeric(as.character(Year)) ] ## Get the GDP in percentage of growth #gdpGrowth[ , provincial.gdp.Growth := (provincial.gdp.Growth - 100)] ## Let's keep only the province for which we have PM2.5 data listOfProvince <- c("Shanghai", "Beijing", "Sichuan", "Guangdong", "Liaoning") gdpGrowth <- gdpGrowth[ Region %in% listOfProvince ] ## add the province capital gdpGrowth[ , city := Region ] test <- gdpGrowth$Region == "Liaoning" gdpGrowth[test, city := "Shenyang"] test <- gdpGrowth$Region == "Guangdong" gdpGrowth[test, city := "Guangzhou" ] test <- gdpGrowth$Region == "Sichuan" gdpGrowth[test, city := "Chengdu"]
g <- ggplot(data = gdpGrowth, aes( x = Year,y = provincial.gdp.Growth), group = city) + theme_tufte() + theme(legend.position = "none") + geom_segment( aes(x=Year, xend=Year, y=0, yend=provincial.gdp.Growth, group = city), color = "grey") + geom_point(aes(col = city)) + scale_color_brewer(palette="Dark2") + geom_hline(yintercept = 0, color = "grey") + facet_wrap(facets = city ~ ., nrow = 1) + ggtitle("GDP Growth of associated province") + xlab("Year") + theme(axis.text.x = element_text(angle = 90)) + ylab("Provincial GDP growth in %") p <- plotly::ggplotly(g) ## plotly label spacing p[['x']][['layout']][['annotations']][][['y']] <- -0.15 p[['x']][['layout']][['annotations']][][['x']] <- -0.04 p %>% config(displayModeBar = F)
We have now two clean datasets with PM2.5 readings for new year’s eve and new year’s day as well as GDP growth for the provinces concerned.
Let’s save the data for next post
save(gdpGrowth, aqicny, file = "./data/gqp-aqi-part1.Rda")