Getting and cleaning data, example of Chinese airports - Part 2/5

By Datapleth.io | December 30, 2015

One of the big problem for anybody interested in China and data science is the availability of data sets. There are limited free resources available and they are often incomplete or inaccurate. Getting data and especially cleaning data becomes one of the biggest pain of data science applied to China.

The objective of this group of post is to illustrate the problem and associated process on a specific example: plot a map of the airports of mainland China.

In this second part, we will do some exploratory data analysis and data cleaning of the Openflight data set we obtained in part 1

# We will need few standard packages as loaded bellow.
library(maptools)
library(reshape2)
library(grid)
library(dplyr)
library(ggplot2)
library(Amelia)
library(knitr)
library(kableExtra)

We start by loading part 1 objects.

load(file = "./data/chinese-airports-part-1.Rda")

Openflight : exploratory data analysis

str(airportChinaOpenflights)
## 'data.frame':    219 obs. of  7 variables:
##  $ name         : chr  "Capital Intl" "Hongyuan Airfield" "Dongshan" "Binhai" ...
##  $ city         : chr  "Beijing" "Hongyuan" "Hailar" "Tianjin" ...
##  $ IATA.FAA     : chr  "PEK" "" "HLD" "TSN" ...
##  $ ICAO         : chr  "ZBAA" "\\N" "ZBLA" "ZBTJ" ...
##  $ latitude.deg : num  40.1 32.8 49.2 39.1 37.7 ...
##  $ longitude.deg: num  117 103 120 117 113 ...
##  $ elevation.m  : num  35.43 3512.1 662.41 3.05 786.4 ...

Seems that we should have all field we need but some data may be missing, let’s investigate.

Variable name

By the command unique(airportChinaOpenflights$name) we find out that some of the names are containing Railway Stations, Ferry, ports. They are listed bellow with their IATA.FAA and ICAO code. All are missing, except GZS, after checking online, it’s confirmed as a train station and not an airport.

test <- airportChinaOpenflights[
    grepl(
        pattern = "RAILWAY|STATION| PORT|FERRY",
        x = toupper(airportChinaOpenflights$name)
        ),]
knitr::kable(head(dplyr::select(test, city, name, IATA.FAA, ICAO))) %>%
  kable_styling(
    bootstrap_options = c(
      "striped"
      , "hover"
      , "condensed"
      , "responsive"
      )
    ) %>% scroll_box(width = "100%")
city name IATA.FAA ICAO
6971 Guangzhou Guangzhou South Railway Station GZS
7326 Urumqi Urumqi Railway Station
7327 Turpan Turpan Railway Station
7410 Beijing Beijing Railway Station
7411 Chengde Chengde Railway Station
7445 Fuzhou Fuzhou Railway Station
## Inititalise category field, consider all as Airport
airportChinaOpenflights$category <- "civil airport"

Checking wikipedia we find out that IATA also provides codes for railway stations and for airport handling entities.

airportChinaOpenflights <- airportChinaOpenflights[airportChinaOpenflights$category == "civil airport",]

We keep only the 219 airports.

Variable city

sum(is.na(airportChinaOpenflights$city))
## [1] 0
sum(airportChinaOpenflights$city == "")
## [1] 0
airportChinaOpenflights[grepl(pattern = "XI.*AN", toupper(airportChinaOpenflights$city)),]$city <- "Xi'an"

No missing values, no NAs for city but we just need to fix Xi’an city name, which appear in one of the record with for escape char.

Variable IATA.FAA

From wikipedia

An IATA airport code, also known an IATA location identifier, IATA station code or simply a location identifier,[1] is a three-letter code designating many airports around the world, defined by the International Air Transport Association (IATA).

Let’s check the airport which are not matching a 3 letters pattern, what’s their code ? As seen bellow, in fact the data is missing, let’s substitute by NAs.

unique(airportChinaOpenflights[!grepl( pattern = "[A-Z][A-Z][A-Z]",x = airportChinaOpenflights$IATA.FAA),]$IATA.FAA)
## [1] ""
airportChinaOpenflights[!grepl( pattern = "[A-Z][A-Z][A-Z]",x = airportChinaOpenflights$IATA.FAA),]$IATA.FAA <- NA

Is IATA unique ? … seems yes ! No double identification by IATA.FAA

checkIATA <- as.data.frame(table(airportChinaOpenflights[!is.na(airportChinaOpenflights$IATA.FAA),]$IATA.FAA))
names(checkIATA) <- c("IATA.FAA", "freq")
checkIATA[checkIATA$freq > 1,]$IATA.FAA
## factor(0)
## 185 Levels: AAT ACX AEB AKA AKU AQG BAV BFJ BHY BJS BPX BSD CAN CGD CGO ... ZUH

Variable ICAO code

From wikipedia

The ICAO airport code or location indicator is a four-character alphanumeric code designating aerodromes around the world. These codes are defined by the International Civil Aviation Organization,[…]. They differ from IATA codes, which are generally used for airline timetables, reservations, and baggage tags. For example, the IATA code for London’s Heathrow Airport is LHR and its ICAO code is EGLL. ICAO codes are commonly seen by passengers and the general public on flight-tracking services such as FlightAware. In general IATA codes are usually derived from the name of the airport or the city it serves, while ICAO codes are distributed by region and country.

Is ICAO unique ?, seems yes !, no double identification by ICAO

checkICAO <- as.data.frame(
    table(airportChinaOpenflights[!is.na(airportChinaOpenflights$ICAO),]$ICAO)
    )
names(checkICAO) <- c("ICAO", "freq")
checkICAO[checkICAO$freq > 1,]$ICAO
## [1] \\N
## 165 Levels:  \\N CN-0 ZBAA ZBBB ZBCF ZBCZ ZBDS ZBDT ZBHD ZBHH ZBLA ZBOW ... ZYYJ

Variable : latitude.deg, longitude.deg, elevation.m

sum(is.na(airportChinaOpenflights$latitude.deg)) + 
sum(is.na(airportChinaOpenflights$longitude.deg)) +
sum(is.na(airportChinaOpenflights$elevation.m))
## [1] 0

We don’t have missing values in longitude, latitude and elevation. But do we have 0, does the data looks logical ? Let’s build a plot of the distributions of these variables.

##      Create the 2x2 tiled chart to be filled by row
par(mfrow = c(2,2))
hist(airportChinaOpenflights$latitude.deg, breaks = 20)
hist(airportChinaOpenflights$longitude.deg, breaks = 20)
hist(airportChinaOpenflights$elevation.m, breaks = 20)
hist(
    airportChinaOpenflights[airportChinaOpenflights$elevation.m < 30,]$elevation.m,
    breaks = 100, 
    xlab = "Airports with elevation < 30m"
    )

With this tiled chart we can visualize quickly that distribution of latitude and longitude looks reasonable with no extreme values included. However, regarding elevation, we have a lot of airport in low elevation and in the bottom right chart we see that most of them (>60) are at 0 meters or very close to zero. Actually we have 72 airports with an elevation of 0 meters.

g <- ggplot() + theme_bw()
## airports with elevation
g <- g + geom_point(data = airportChinaOpenflights,
                    aes(x = longitude.deg, y = latitude.deg, colour = elevation.m),
                    pch = 19)
## airport with 0 m  as elevation in red
g + geom_point(data = airportChinaOpenflights[airportChinaOpenflights$elevation.m == 0,], 
                    aes(x = longitude.deg, y = latitude.deg, colour = elevation.m),
                    size = 5, alpha = 0.2, col = "firebrick", pch = 1) +
        ggtitle("exploratory plot of OF dataset with elevation")

This map is the plot of airports based on the geolocalisation and elevation. Airports with 0m of elevation are highlighted by a red circle. We can conclude that in fact their elevation is missing and replace with NAs.

airportChinaOpenflights[airportChinaOpenflights$elevation.m == 0,]$elevation.m <- NA

Conclusion

Let’s use the Amelia package to get an overview of the missing data in the data set.

missmap(airportChinaOpenflights, 
        main = "Missingness of Openflight Chinese airports")

We have a dataset of 219 airports which still requires some work to complete the missing data for few IATA codes and several ICAO Codes. Most of the elevation data is missing.

Saving objects for next part

Let’s now save the different objects to be reused in the next part of this post.

save( list = c("airportChinaOpenflights", 
               "airportChinaOurairports"
               ),
      file = "./data/chinese-airports-part-2.Rda")

In the next post, we will do some exploratory data analysis and data cleaning of the OurAirports data set.

Code information

Source code

The source code of this post is available on github

Session information

sessionInfo()
## R version 3.6.1 (2017-01-27)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /home/travis/R-bin/lib/R/lib/libRblas.so
## LAPACK: /home/travis/R-bin/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] kableExtra_1.1.0 knitr_1.26       Amelia_1.7.6     Rcpp_1.0.3      
## [5] ggplot2_3.2.1    dplyr_0.8.3      reshape2_1.4.3   maptools_0.9-9  
## [9] sp_1.3-2        
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.8         compiler_3.6.1    pillar_1.4.2      plyr_1.8.5       
##  [5] tools_3.6.1       zeallot_0.1.0     digest_0.6.23     viridisLite_0.3.0
##  [9] evaluate_0.14     tibble_2.1.3      lifecycle_0.1.0   gtable_0.3.0     
## [13] lattice_0.20-38   pkgconfig_2.0.3   rlang_0.4.2       rstudioapi_0.10  
## [17] yaml_2.2.0        blogdown_0.17.1   xfun_0.11         xml2_1.2.2       
## [21] httr_1.4.1        withr_2.1.2       stringr_1.4.0     vctrs_0.2.0      
## [25] hms_0.5.2         webshot_0.5.2     tidyselect_0.2.5  glue_1.3.1       
## [29] R6_2.4.1          foreign_0.8-72    rmarkdown_1.18    bookdown_0.16    
## [33] farver_2.0.1      readr_1.3.1       purrr_0.3.3       magrittr_1.5     
## [37] backports_1.1.5   scales_1.1.0      htmltools_0.4.0   rvest_0.3.5      
## [41] assertthat_0.2.1  colorspace_1.4-1  labeling_0.3      stringi_1.4.3    
## [45] lazyeval_0.2.2    munsell_0.5.0     crayon_1.3.4