Blogs

Getting and cleaning data, example of Chinese airports - Part 2/5

One of the big problem for anybody interested in China and data science is the availability of data sets. There are limited free resources available and they are often incomplete or inaccurate. Getting data and especially cleaning data becomes one of the biggest pain of data science applied to China. The objective of this group of post is to illustrate the problem and associated process on a specific example: plot a map of the airports of mainland China.

Continue reading

Getting and cleaning data, example of Chinese airports - Part 1/5

One of the big problem for anybody interested in China and data science is the availability of data sets. There are limited free resources available and they are often incomplete or inaccurate. Getting data and especially cleaning data becomes one of the biggest pain of data science applied to China. The objective of this series of post is to illustrate the problem and associated process on a specific example: plot a map of the airports of mainland China.

Continue reading

China Urbanisation & Large cities

In this article we are going to plot a map of China urbanization rate per provinces together with Chinese cities with at least 2 millions population. In a nutshell, we’ll get first rural and urban population data from official China statistic bureau, then clean the data, we’ll repeat the same two steps for Chinese largest cities. Secondly, we’ll prepare a map of China with provinces. Then we will add the main Chinese cities and their population and a choropleth of urbanization rate, add main cities

Continue reading

Map of China mainland provinces, municipalities and autonomous regions (in Chinese)

In this article we are going to plot a simple map of China with different levels of subdivisions using both base and ggplot2 systems. In a nutshell, we will have first to get shape files with different subdivision levels, then a bit of data cleaning will be necessary in order to get proper provinces Chinese names. Finally we will plot China base map with subdivisions and add subdivisions names on the map.

Continue reading

Under the hood - Software configuration of datapleth.io

This is a short article to briefly describe the set of tools, software which are used by dataapleth.io for data processing, statistical computing and publishing on this blog. In a nutshell : linux as an operating system R for statistical computing, data processing and visualization Rstudio as integrated development environment Git as version control Github for sharing the code Tracis as CI/CD (test and deploy) Rmarkdown for writing articles (weaving analysis text, code and output) Hugo as static website enging blogdown package to intrate with R Operating System dataleth.

Continue reading