rdhs client object and internal package design

Overview

The following is a similar vignette - “How to use rdhs?”. However, this one achieves the same results using the rdhs client object directly to download datasets etc, rather than using the easier user interfacing functions. This vignette is included to help demonstrate how the pacakge internals are working and adds more information about how the package was designed, which may be useful for anyone adding extra functionality to rdhs in the future.

If are not bothered by this then please read the main introductory vignette.

rdhs is a package for management and analysis of Demographic and Health Survey (DHS) data. This includes functionality to:

Access standard indicator data (i.e. DHS STATcompiler) in R via the DHS API.
Identify surveys and datasets relevant to a particular analysis.
Download survey datasets from the DHS website.
Load datasets and associated metadata into R.
Extract variables and combining datasets for pooled multi-survey analyses.

This process is described below and should cover most functionality that will be needed for working with these datasets.

0. Installation

Install rdhs from github with devtools:

# install.packages("devtools")
# devtools::install_github("ropensci/rdhs")

library(rdhs)

1. Access standard indicator data via the API

The DHS API has many endpoints that can be accessed using anyone of dhs_<endpoint>() functions. All exported functions within rdhs that start dhs_ interact with a different with a different endpoint of the DHS API. Their website gives great information about the different search terms and filters that can be used, and we have tried to include all of this within the documentation of each function.

One of those endpoints, dhs_data(), interacts with the the published set of standard health indicator data calculated by the DHS. This endpoint allows use to retrieve a set of health indicators that have been sample weighted to give country, subnational estimates that can be further refined by education and wealth brackets. To do this we use the dhs_data() endpoint, which we can then either search for specific indicators, or by querying for indicators that have been tagged within specific areas.

## what are the indicators
indicators <- dhs_indicators()
indicators[1,]

##                                                                                                           Definition
## 1 Age-specific fertility rate for the three years preceding the survey for age group 10-14 expressed per 1,000 women
##   NumberScale IndicatorType MeasurementType IsQuickStat  ShortName
## 1           0             I            Rate           0 ASFR 10-14
##     IndicatorId    Level1 IndicatorTotalId          Level2 Level3
## 1 FE_FRTR_W_A10 Fertility                  Fertility rates  Women
##        SDRID IndicatorOldId TagIds DenominatorWeightedId
## 1 FEFRTRWA10                                            
##                                Label IndicatorOrder
## 1 Age specific fertility rate: 10-14       11763005
##                                                                     Denominator
## 1 Per thousand women years exposed in the period 1-36 months prior to interview
##   QuickStatOrder IndicatorSpecial1Id DenominatorUnweightedId
## 1                                                           
##   IndicatorSpecial2Id
## 1

Each call to a DHS endpoint returns a data.frame by default with all the results available by default.

The DHS has a unique IndicatorId for each of the statistics it calculates. The definition and specific string for each indicator is included within the IndicatorId and Definition variable:

# grab the first 5 alphabetically
indicators[order(indicators$IndicatorId),][1:5,c("IndicatorId", "Definition")]

##        IndicatorId
## 2031 AH_CIGA_M_UNW
## 2026 AH_CIGA_W_10P
## 2023 AH_CIGA_W_12C
## 2024 AH_CIGA_W_35C
## 2025 AH_CIGA_W_69C
##                                                               Definition
## 2031                     Number of men who smoke cigarettes (unweighted)
## 2026 Percentage of women who smoked 10+ cigarettes in preceding 24 hours
## 2023 Percentage of women who smoked 1-2 cigarettes in preceding 24 hours
## 2024 Percentage of women who smoked 3-5 cigarettes in preceding 24 hours
## 2025 Percentage of women who smoked 6-9 cigarettes in preceding 24 hours

Since there are quite a lot of indicators, it might be easier to first query by tags. The DHS tags their indicators by what areas of demography and health they relate to, e.g. anaemia, literacy, malaria parasitaemia are all specific tags. First let’s look at what the tags are, by interacting with the dhs_tags() endpoint, before grabbing data that related to malaria parasitaemia in the DRC and Tanzania since 2010:

# What are the tags
tags <- dhs_tags()

# Let's say we want to view the tags that relate to malaria
tags[grepl("Malaria", tags$TagName), ]

##    TagType                   TagName TagID TagOrder
## 31       0       Malaria Parasitemia    36      540
## 43       2 Select Malaria Indicators    79     1000

# and now let's then grab this data by specifying the countryIds and the survey year starts
data <- dhs_data(tagIds = 36,countryIds = c("CD","TZ"),breakdown="subnational",surveyYearStart = 2010)
data[1,]

##   DataId                           Indicator  SurveyId IsPreferred Value
## 1 941297 Malaria prevalence according to RDT CD2013DHS           1  17.1
##        SDRID Precision        RegionId SurveyYearLabel SurveyType
## 1 MLPMALCRDT         1 CDDHS2013503010         2013-14        DHS
##   SurveyYear IndicatorOrder DHS_CountryCode CILow
## 1       2013      125136010              CD      
##                 CountryName IndicatorType CharacteristicId
## 1 Congo Democratic Republic             I           503010
##   CharacteristicCategory   IndicatorId CharacteristicOrder
## 1                 Region ML_PMAL_C_RDT             1503010
##   CharacteristicLabel ByVariableLabel DenominatorUnweighted
## 1            Kinshasa                                   406
##   DenominatorWeighted CIHigh IsTotal ByVariableId
## 1                 532              0            0

Depending on your analysis this maybe more than enough detail. It is also worth mentioning that this data can also be accessed via DHS STATcompiler if you prefer a click and collect version. However, hopefully one can see that selecting a lot of different indicators for multiple countries and breakdowns should be a lot easier using the rdhs API interaction. For example we can very quickly find out the trends in antimalarial use in Africa, and see if perhaps antimalarial prescription has decreased after RDTs were introduced (assumed 2010).

# Make an api request
resp <- dhs_data(indicatorIds = "ML_FEVT_C_AML", surveyYearStart = 2010,breakdown = "subnational")

# filter it to 12 countries for space
countries  <- c("Angola","Ghana","Kenya","Liberia",
                "Madagascar","Mali","Malawi","Nigeria",
                "Rwanda","Sierra Leone","Senegal","Tanzania")

# and plot the results
library(ggplot2)
ggplot(resp[resp$CountryName %in% countries,],
       aes(x=SurveyYear,y=Value,colour=CountryName)) +
  geom_point() +
  geom_smooth(method = "glm") + 
  theme(axis.text.x = element_text(angle = 90, vjust = .5)) +
  ylab(resp$Indicator[1]) + 
  facet_wrap(~CountryName,ncol = 6)

If we incorrectly entered a filter query (very possible), rdhs will let us know our request was invalid:

# Make an api request
resp <- dhs_data(indicatorIds="ML_FEVT_C_AMasfafasfL",
                 surveyYearStart=202231231306,
                 breakdown="subParTyping")

## Error in handle_api_response(resp, TRUE): 
##    -> DHS API Request Failed [500] 
##    -> Error Type: dhs_internal_server_error

2. Identify surveys relevant for further analysis

You may, however, wish to do more nuanced analysis than the API allows. The following 4 section detail a very basic example of how to quickly identify, download and extract datasets you are interested in.

Let’s say we want to get all the survey data from the Democratic Republic of Congo and Tanzania in the last 5 years (since 2013), which covers the use of rapid diagnostic tests (RDTs) for malaria. To begin we’ll interact with the DHS API to identify our datasets.

To start our extraction we’ll query the surveyCharacteristics endpoint using dhs_surveyCharacteristics():

## make a call with no arguments
sc <- dhs_survey_characteristics()
sc[grepl("Malaria", sc$SurveyCharacteristicName), ]

##    SurveyCharacteristicID SurveyCharacteristicName
## 57                     96            Malaria - DBS
## 58                     90     Malaria - Microscopy
## 59                     89            Malaria - RDT
## 60                     57          Malaria module 
## 61                      8 Malaria/bednet questions

There are 87 different survey characteristics, with one specific survey characteristic for Malaria RDTs. We’ll use this to then find the surveys that include this characteristic. We can also at this point filter for our desired countries and years. The DHS API allows for countries to be filtered using by their countryIds, which is one of the arguments in dhs_surveys(). To have a look at what each countries countryId is we can use another of the API endpoints first:

## what are the countryIds
ids <- dhs_countries(returnFields=c("CountryName", "DHS_CountryCode"))
str(ids)

## 'data.frame':    91 obs. of  2 variables:
##  $ DHS_CountryCode: chr  "AF" "AL" "AO" "AM" ...
##  $ CountryName    : chr  "Afghanistan" "Albania" "Angola" "Armenia" ...

# lets find all the surveys that fit our search criteria
survs <- dhs_surveys(surveyCharacteristicIds = 89,countryIds = c("CD","TZ"),surveyYearStart = 2013)

# and lastly use this to find the datasets we will want to download and let's download the flat files (.dat) datasets (have a look in the dhs_datasets documentation for all argument options, and fileformat abbreviations etc.)
datasets <- dhs_datasets(surveyIds = survs$SurveyId, fileFormat = "flat")
str(datasets)

## 'data.frame':    19 obs. of  13 variables:
##  $ FileFormat          : chr  "Flat ASCII data (.dat)" "Flat ASCII data (.dat)" "Flat ASCII data (.dat)" "Flat ASCII data (.dat)" ...
##  $ FileSize            : int  198561 7030083 3226262 8028957 11426382 4794941 1569680 6595349 63022 996906 ...
##  $ DatasetType         : chr  "HIV Datasets" "Survey Datasets" "Survey Datasets" "Survey Datasets" ...
##  $ SurveyNum           : int  421 421 421 421 421 421 421 421 421 421 ...
##  $ SurveyId            : chr  "CD2013DHS" "CD2013DHS" "CD2013DHS" "CD2013DHS" ...
##  $ FileType            : chr  "HIV Test Results Recode" "Births Recode" "Couples' Recode" "Household Recode" ...
##  $ FileDateLastModified: chr  "November, 14 2014 12:48:34" "November, 17 2014 15:42:54" "November, 17 2014 15:43:04" "September, 19 2016 09:57:20" ...
##  $ SurveyYearLabel     : chr  "2013-14" "2013-14" "2013-14" "2013-14" ...
##  $ SurveyType          : chr  "DHS" "DHS" "DHS" "DHS" ...
##  $ SurveyYear          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ DHS_CountryCode     : chr  "CD" "CD" "CD" "CD" ...
##  $ FileName            : chr  "CDAR61FL.ZIP" "CDBR61FL.ZIP" "CDCR61FL.ZIP" "CDHR61FL.ZIP" ...
##  $ CountryName         : chr  "Congo Democratic Republic" "Congo Democratic Republic" "Congo Democratic Republic" "Congo Democratic Republic" ...

Lastly, we recommended to download either the spss (.sav), fileFormat = "SV", or the flat file (.dat), fileFormat = "FL" datasets. The flat is quicker, but there are still one or two datasets that don’t read correctly, whereas the .sav files are slower to read in but so far no datasets have been found that don’t read in correctly.

We can now use this to download our datasets for further analysis.

3. Download survey datasets

We can now go ahead and download our datasets. To do this we need to first create a client. The client is an R6 class (similar to R’s built in reference classes and make caching survey and API queries more reproducible) and will be used to log in to your DHS account, download datasets for you, and help query those datasets for the question you are interested in. The client will also cache all of these processes, which really helps increase the reproducibility of your analysis.

In order to set up our credentials we use the function set_rdhs_config(). This will require providing as arguments your email and project for which you want to download datasets from. You will then be prompted for your password.

You can also specify a directory for datasets and API calls to be cached to using cache_path. If you do not provide an argument for cache_path you will be prompted to provide permission to rdhs to save your datasets and API calls within your user cache directory for your operating system. This is to comply with CRAN’s requests for permission to be granted before writing to system files. If you do not grant permission, these will be written within your R temporary directory (as we saw above when we first used one of the functions to query the API). Similarly if you do not also provide an argument for config_path, this will be saved within your temp directory unless permission is granted. Your config files will always be called “rdhs.json”, so that rdhs can find them easily.

## create a client
## set up your credentials
config <- set_rdhs_config(email = "rdhs.tester@gmail.com",
                project = "Testing Malaria Investigations")

client <- client_dhs(config)
client

## <client_dhs>
##   Public:
##     available_datasets: function (clear_cache_first = FALSE) 
##     clear_namespace: function (namespace) 
##     dhs_api_request: function (api_endpoint, query = list(), api_key = private$api_key, 
##     extract: function (questions, add_geo = FALSE) 
##     get_cache_date: function () 
##     get_config: function () 
##     get_datasets: function (dataset_filenames, download_option = "rds", reformat = FALSE, 
##     get_downloaded_datasets: function () 
##     get_root: function () 
##     get_variable_labels: function (dataset_filenames = NULL, dataset_paths = NULL, rm_na = FALSE) 
##     initialize: function (config, api_key = NULL, root = rappdirs_rdhs()) 
##     save_client: function () 
##     set_cache_date: function (date) 
##     survey_questions: function (dataset_filenames, search_terms = NULL, essential_terms = NULL, 
##     survey_variables: function (dataset_filenames, variables, essential_variables = NULL, 
##   Private:
##     api_endpoints: data indicators countries surveys surveycharacteristics  ...
##     api_key: ********
##     cache_date: 2018-09-24 16:36:45
##     check_available_datasets: function (filenames) 
##     config: rdhs_config
##     na_s: ^na -|^na-|.*-na$|.* - na$| \{NA\}$|.* NA$|.*NA$
##     package_version: package_version, numeric_version
##     root: C:\Users\Oliver\AppData\Local\Oliver\rdhs\Cache
##     storr: storr, R6
##     url: https://api.dhsprogram.com/rest/dhs/
##     user_declared_root: NULL

Before we use our client to download our datasets, it is worth mentioning that the client can be passed as an argument to any of the API functions we have just seen. This will then cache the results for you, so that if you are working remotely or without a good internet connection you can still return your previous API requests. This is what is happening behind the scenes anyway, as rdhs will create a client for you when you call an API function (if one does not already exist).

Now back to our dataset downloads. If we have a look back at our datasets object, we’ll see there are 19 datasets listed. However, not all of them will be relevant to our malaria RDT questions. One approach is to head to the DHS website and have a look at the DHS Recodes, and look at the recodes that relate to the surveys. The other alternative is to download all the surveys and then query the variables within them. This is what we’ll demonstrate here as it also demonstrates more of the package’s functionality:

So first we will download all these datasets:

# download datasets
downloads <- client$get_datasets(datasets$FileName)

The function returns a list with a file path to where the downloaded datasets have been saved to. By default the files will download quietly, i.e. no progress is shown. However, if you want to see the progress then you can control this by setting this in your config using the verbose_download argument.

4. Load datasets and associated metadata into R.

We can now examine what it is we have actually downloaded, by reading in one of these datasets:

# read in our dataset
cdpr <- readRDS(downloads$CDPR61FL)

The dataset returned here contains all the survey questions within the dataset, and what their survey variables are:

# let's look at the variable_names
head(get_variable_labels(cdpr))

##   variable                                                  description
## 1     hhid                                          Case Identification
## 2    hvidx                                                  Line number
## 3    hv000                                       Country code and phase
## 4    hv001                                               Cluster number
## 5    hv002                                             Household number
## 6    hv003 Respondent's line number (answering Household questionnaire)

# and then the dataset
class(cdpr$hv024)

## [1] "labelled"

This is the default behaviour for the client function get_datasets - it will download the datasets for you, and then by default save them in your client’s root directory and then unzip them and read them in for you, and save the resultant data.frame as a .rds object within the client’s root directory. You can control this behaviour using the download_option argument as such:

client$get_datasets(download_option = "zip") - Just the downloaded zip will be saved
client$get_datasets(download_option = "rds") - Just the read in rds will be saved
client$get_datasets(download_option = "both") - The zip is downloaded and saved as well as the read in rds

The other main reason for reading the dataset in straight away as the default option is that the created table of all the survey variables and their definitions is cached then and there, which then allows us to quickly query for particular search terms or survey variables:

# rapid diagnostic test search
questions <- client$survey_questions(datasets$FileName, search_terms = "malaria rapid test")

table(questions$dataset_filename)

## 
## CDHR61FL CDPR61FL TZHR7AFL TZPR7AFL 
##       24        1       48        1

What we see from the questions is that the question “Result of malaria rapid test” appears in a few different datasets. This is because the household member recode datasets (CDPR61SV, TZPR7HSV) stores information about the children in a household, with one row per child, whereas the household recode (CDHR61SV, TZHR7HSV) stores information about the household, and thus flattens the information from each child into different subvariables (hml35$01/02 etc). As such it is easier to extract this information from the household member recodes.

5. Extract variables and combining datasets for pooled multi-survey analyses.

To extract our data we pass our questions object to the client function extract, which will create a list with each dataset and its extracted data as a data.frame. We also have the option to add any geographic data available, which will download the geographic data files for you and add this data to you resultant extract:

# let's just use the PR files thus
datasets <- dhs_datasets(surveyIds = survs$SurveyId, fileFormat = "FL", fileType = "PR")
downloads <- client$get_datasets(datasets$FileName)

# and grab the questions from this again along with also questions detailing the province
questions <- client$survey_questions(datasets$FileName, search_terms = c("malaria rapid test"))

# and now extract the data
extract <- client$extract(questions, add_geo = FALSE)

# what does our extract look like
str(extract)

## List of 2
##  $ CDPR61FL:Classes 'dhs_dataset' and 'data.frame':  95949 obs. of  2 variables:
##   ..$ hml35   : 'labelled' int [1:95949] NA NA NA NA NA NA NA 1 0 NA ...
##   .. ..- attr(*, "label")= chr "Result of malaria rapid test"
##   .. ..- attr(*, "labels")= Named int [1:3] 0 1 9
##   .. .. ..- attr(*, "names")= chr [1:3] "negative" "positive" "missing"
##   ..$ SurveyId: chr [1:95949] "CD2013DHS" "CD2013DHS" "CD2013DHS" "CD2013DHS" ...
##  $ TZPR7AFL:Classes 'dhs_dataset' and 'data.frame':  64880 obs. of  2 variables:
##   ..$ hml35   : 'labelled' int [1:64880] NA NA NA NA NA NA NA 0 NA NA ...
##   .. ..- attr(*, "label")= chr "Result of malaria rapid test"
##   .. ..- attr(*, "labels")= Named int [1:3] 0 1 9
##   .. .. ..- attr(*, "names")= chr [1:3] "negative" "positive" "missing"
##   ..$ SurveyId: chr [1:64880] "TZ2015DHS" "TZ2015DHS" "TZ2015DHS" "TZ2015DHS" ...

The resultant extract is a list, with a new element for each different dataset that you have extracted. The responses from the dataset are by default stored as a labelled class from the haven package. This class preserves the original semantics and can easily be coerced to factors with haven::as_factor(). Special missing values are also preserved. For more info on the labelled class have a look at their github.

We can also query our datasets for the survey question variables. In the example above the survey question was Result of malaria rapid test and the variable was hml35. So if you knew the survey variables that you wanted (either by looking at the Recode file or by looking through the variable_names included in the datasets) then we could search against these. So let’s grab the regions using hv024 using the client function survey_variables():

# and grab the questions from this now utilising the survey variables
questions <- client$survey_variables(datasets$FileName, variables = c("hv024","hml35"))

# and now extract the data
extract2 <- client$extract(questions, add_geo = FALSE)

# quick check 
head(extract2$CDPR61FL)

##   hv024 hml35  SurveyId
## 1     4    NA CD2013DHS
## 2     4    NA CD2013DHS
## 3     4    NA CD2013DHS
## 4     4    NA CD2013DHS
## 5     4    NA CD2013DHS
## 6     4    NA CD2013DHS

head(extract2$TZPR7HFL)

## NULL

# and just to prove that hml35 did actually read in okay (there are just lots of NA)
table(extract2$CDPR61FL$hml35,useNA = "always")

## 
##     0     1     9  <NA> 
##  5260  2959     8 87722

We can now combine our two dataframes for further analysis using the rdhs package function rbind_labelled(). This function works specifically with our lists of labelled data.frames:

# first let's bind our first extraction, without the hv024
extract_bound <- rbind_labelled(extract)

head(extract_bound)

##            hml35  SurveyId  DATASET
## CDPR61FL.1    NA CD2013DHS CDPR61FL
## CDPR61FL.2    NA CD2013DHS CDPR61FL
## CDPR61FL.3    NA CD2013DHS CDPR61FL
## CDPR61FL.4    NA CD2013DHS CDPR61FL
## CDPR61FL.5    NA CD2013DHS CDPR61FL
## CDPR61FL.6    NA CD2013DHS CDPR61FL

# now let's try our second extraction
extract2_bound <- rbind_labelled(extract2)

## Warning in rbind_labelled(extract2): Some variables have non-matching value labels: hv024.
## Inheriting labels from first data frame with labels.

This hasn’t quite done what we might want in the second instance. The hv024 variable stores the regions for these 2 countries, which will not be the same and thus the labels will be different between the two of them. Without specifying any additional arguments rbind_labelled() will simply use the first data.frames labelling as the default, which will mean that some of the Tanzanian provinces will have been encoded as DRC provinces - not good! (This is a similar problem in nature to say trying to add new character strings to a factored data.frame).

There are a few work arounds. Firstly, we can specify a labels argument to the function which will detail how we should handle different variables. labels is a names list that specifies how to handle each variable. If we simply want to keep all the labels then we us the string “concatenate”:

# lets try concatenating the hv024
better_bound <- rbind_labelled(extract2, labels = list("hv024"="concatenate"))

head(better_bound$hv024)

## <Labelled integer>
## [1] 6 6 6 6 6 6
## 
## Labels:
##  value            label
##      1           arusha
##      2         bandundu
##      3        bas-congo
##      4    dar es salaam
##      5           dodoma
##      6         equateur
##      7            geita
##      8           iringa
##      9           kagera
##     10 kasai-occidental
##     11   kasai-oriental
##     12  kaskazini pemba
##     13 kaskazini unguja
##     14          katanga
##     15           katavi
##     16           kigoma
##     17      kilimanjaro
##     18         kinshasa
##     19     kusini pemba
##     20    kusini unguja
##     21            lindi
##     22          maniema
##     23          manyara
##     24             mara
##     25            mbeya
##     26  mjini magharibi
##     27         morogoro
##     28           mtwara
##     29           mwanza
##     30           njombe
##     31        nord-kivu
##     32        orientale
##     33            pwani
##     34            rukwa
##     35           ruvuma
##     36        shinyanga
##     37           simiyu
##     38          singida
##     39         sud-kivu
##     40           tabora
##     41            tanga

We could also specify new labels for a variable. For example, imagine the two datasets encoded their RDT responses differently, with the first one as c("No","Yes") and the other as c("Negative","Positive"). These would be for our purposes the same response, and so we could either leave it and all our results would use the c("No","Yes") labelling. But we may want to use the latter as it’s more informative/correct, or we may want to be crystal clear and use c("NegativeTest","PositiveTest"). we can do that like this:

# lets try concatenating the hv024 and providing new labels
better_bound <- rbind_labelled(extract2,
                               labels = list("hv024"="concatenate",
                                             "hml35"=c("NegativeTest"=0, "PositiveTest"=1)))

# and our new label
head(better_bound$hml35)

## <Labelled integer>
## [1] NA NA NA NA NA NA
## 
## Labels:
##  value        label
##      0 NegativeTest
##      1 PositiveTest

The other option is to not use the labelled class at all. We can control this when we download our datasets, using the argument reformat=TRUE. This will ensure that no factors or labels are used and it is just the raw data. When this option is set the object returned by client$get_datasets() no longer has any labelled classes or factors. However, we can still recover the variable table for a dataset using get_variable_labels(), which will take any dataset output by get_datasets() and return a data.frame describing the survey question variables and definitions.

# download the datasets with the reformat arguments
downloads <- client$get_datasets(datasets$FileName, reformat=TRUE)

# grab the questions but specifying the reformat argument
questions <- client$survey_variables(datasets$FileName, variables = c("hv024", "hml35"),
                                     reformat=TRUE)

# and now extract the data
extract3 <- client$extract(questions, add_geo = FALSE)

# group our results
bound_no_labels <- rbind_labelled(extract3)

# what does our hv024 look like now
class(bound_no_labels$hv024[1])

## [1] "character"

The hv024 column is now just characters, which is possibly the best option depending on your downstream analysis/preferences. It’s for this reason that the geographic data that is added is never turned into factors or labels.

Lastly, we can now use our extract dataset to carry out some regression analysis, to investigate the relationship between malaria prevalence and the quality of wall materials. To do this we will need to first grab the sample weights and stratification from the surveys, along with the extra variables and we will then check the RDT prevalence calculated using the raw data versus the API:

# grab the additional variable hv023 and hv024 which have the strata and weights respectively
questions <- client$survey_variables(datasets$FileName,variables = c("hv005","hv021","hv022","hv023","hv024",
                                                              "hv025","hv214","hml20", "hc1","hml35"))
extraction <- client$extract(questions,TRUE)

# now concatenate the provinces as before and remove missing responses
dat <- rbind_labelled(extraction,labels=list("hv024"="concatenate","hv214"="concatenate"))
dat <- dat[-which(dat$hml35==9),] # remove missing responses

# and we are going to compare our extract to the API malaria prevalence by RDT, which is for those between 6 and 59 months
dat <- dat[-which(!dat$hc1>6 & dat$hc1<=60),]

# create a denominator response for hml35
dat$hml35denom <- as.integer(!is.na(dat$hml35))
dat$bricks <- dat$hv214 %in% c(8,18,5,9,10)
dat$net <- as.logical(dat$hml20)

# specify the strata and sample weights
dat$strata <- paste0(dat$hv023,dat$DATASET)
dat$hv005 <- dat$hv005/1e6

# construct a survey design using the survey pacakge
library(survey)

# construct the sample design and calculate the mean and totals 
des <-  survey::svydesign(~CLUSTER+DATASET,data=dat,weight=~hv005)
results <- cbind(survey::svyby(~hml35,by=~DHSREGNA+DATASET, des, survey::svyciprop,na.rm=TRUE),
                 survey::svyby(~hml35denom,by=~DHSREGNA+DATASET, des, survey::svytotal,na.rm=TRUE))
results <- results[order(results$DATASET),]

# grab the same data from the API 
dhs_api_data <- dhs_data(countryIds = c("CD","TZ"),indicatorIds = "ML_PMAL_C_RDT",breakdown = "subnational",surveyYearStart = 2013, surveyYearEnd = 2016)
dhs_api_data <- cbind(dhs_api_data$Value,dhs_api_data$DenominatorWeighted,dhs_api_data$CharacteristicLabel, dhs_api_data$SurveyId)
api <- dhs_api_data[!grepl("\\.\\.",dhs_api_data[,3]),] # remove subregions included in Tanzania
api <- api[order(apply(api[,4:3],1,paste,collapse="")),]

# bind the results and remove duplicate Region Columns
comparison <- cbind(results[,c(1,3,7)],api[])
names(comparison) <- c("Region","Survey_RDT_Prev","Survey_RDT_Denom","API_RDT_Prev","API_RDT_Denom","API_Regions","SurveyID")

head(comparison[,c(1,2,4,3,5,7)])

##                                     Region Survey_RDT_Prev API_RDT_Prev
## Bandundu.CDPR61FL                 Bandundu       0.2038607         20.2
## Bas-Congo.CDPR61FL               Bas-Congo       0.4761599         47.1
## Equateur.CDPR61FL                 Equateur       0.2732268         27.4
## Kasai-Occidental.CDPR61FL Kasai-Occidental       0.4507341         44.5
## Kasai-Oriental.CDPR61FL     Kasai-Oriental       0.4923113         49.4
## Katanga.CDPR61FL                   Katanga       0.3989890         38.9
##                           Survey_RDT_Denom API_RDT_Denom  SurveyID
## Bandundu.CDPR61FL                1415.6056          1414 CD2013DHS
## Bas-Congo.CDPR61FL                342.6473           347 CD2013DHS
## Equateur.CDPR61FL                1267.4746          1236 CD2013DHS
## Kasai-Occidental.CDPR61FL         604.6050           612 CD2013DHS
## Kasai-Oriental.CDPR61FL           892.6006           894 CD2013DHS
## Katanga.CDPR61FL                  861.2030           844 CD2013DHS

It’s a little off, which will be due to the specific stratification the DHS Program will have used, as well as potentially how they have grouped the primary sampling units. We are hoping to get this information from the DHS for each survey so we can make this process more streamline again.

And lastly we will construct a logistic regression to investigate the relationship between a positive malaria RDT and whether the main walls of an individual’s house were made of bricks or similar, while adjusting for urban/rural (v025) and fixed effects for each survey.

# contsruct our glm using svyglm and specify quasibinomial to handle the na in hml35
summary(svyglm(hml35 ~ DATASET + hv025 + net + bricks, des, family="quasibinomial"))

## 
## Call:
## svyglm(formula = hml35 ~ DATASET + hv025 + net + bricks, des, 
##     family = "quasibinomial")
## 
## Survey design:
## survey::svydesign(~CLUSTER + DATASET, data = dat, weight = ~hv005)
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.65528    0.23267  -7.114 3.20e-12 ***
## DATASETTZPR7AFL -0.95553    0.12901  -7.406 4.39e-13 ***
## hv025            0.52558    0.13009   4.040 6.03e-05 ***
## netTRUE          0.06529    0.07163   0.911    0.362    
## bricksTRUE      -0.72109    0.13517  -5.335 1.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.9563804)
## 
## Number of Fisher Scoring iterations: 4

What we can see is that a significant negative gradient was associated with walls being made of bricks or similarly good materials in comparison to malaria positivity rates by RDT. What is also interesting is that whether the individual slept under a long lasting insecticidal net (hml20 that we converted to net) was not significant.

Summary and further thoughts

Hopefully the above tutorial has shown how the rdhs package can facilitate both querying the DHS API and hopefully make downloading and interacting with the raw datasets a smoother, more reproducible process. It is worth bearing in mind though, that creating a harmonised dataset is not always as easy as the example above - a lot of the time survey variables differ across years and surveys, which is hopefully when the survey_questions functionality will make it easier to first filter down to those that include the relevant questions before having to decide which survey questions are valid.

Any suggestions or comments/corrections/errors/ideas please let me know either in the issues or send me an email at “o.watson15@imperial.ac.uk”. And if there is any further functionality that you think you would be useful, then also let me know. :)

OJ Watson, Jeff Eaton

2018-09-24