Global Data Lab

Database Developing World

Follow @globaldatalab on Twitter

Structure of the Database

The DDW is not a database in the classic sense. It is a data infrastructure, consisting of many separately archived datafiles. The picture below illustrates how this data infrastructure is built and organized.

The data in the DDW are derived from household surveys held in developing countries. The original datasets of these surveys, as we receive them from the providers, are marked A in the above figure. Datasets from the same provider often resemble each other, although there generally are small differences due to changes of definitions or country-specific questions. Datasets from different providers differ in many respects, like structure of the file, variables included, codings, names, labels. To be able to do comparative analyses, these "original datasets" have to be harmonized. This is done in step B.

To harmonize the data, for each new dataset a program is written that:

  • Makes a basic set of variables comparable accross all surveys (e.g. education is recoded into the same six categories)
  • Creates a number of new variables (e.g. for each child we create a variable indicating how many brothers and sisters (s)he has)
  • Aggregates a basic set of variables to the district and national level and includes them in the context database D (e.g. the percentages of households with a car or tv indicate level of development)
  • Keeps all remaining variables in the dataset, so that they are available for future research.

The outcome of step B is a standardized dataset (C) that is stored in our archive. This archive currently contains hundreds of standardized datasets.

To run an analysis for a specific set of countries, a program (E) is written that approaches the archived datasets of the selected countries, picks out the variables needed for the analysis and brings them together into a working dataset (F). The program also approaches the context database to add the required context variables to the working dataset. On this working dataset the substantial analysis (G) is run with statistical software (often a multilevel program).