Formatting Migration Recall Data for Longitudinal Analysis

Migration Data Discovery Data Manipulation pivot_longer regex

Use tidyr::pivot_longer to reshape wide data into a long format.

Matt Gunther (IPUMS PMA Senior Data Analyst)
04-15-2021

Most of the data you’ll find at IPUMS PMA comes from cross-sectional surveys, where each respondent is interviewed only once. However, there are some items on the Female Questionnaire that ask respondents to recall past events. When these recall data are linked to a measure of time, the data can be restructured to simulate longitudinal data – or repeated observations on individuals over time. Once the data are in this structure, we can use a range of analytic tools to determine how the frequency or duration of past experiences explains current outcomes.

One place where you’ll find this type of data is on the migration variables page. In our last post, we saw that PMA samples from Ethiopia began including information about each respondent’s single most most recent migration experience beginning with the 2016 sample. More recently, a number of samples from other countries have collected data about each respondent’s complete migration history, organized in chronological order. These samples include:

In this post, we’ll take a look at the available information in the migration data collected from these newer samples. As we’ll see, female respondents who indicate that they have migrated at least once receive the same set of questions for each place they have lived, resulting in a dataset that is exceptionally wide and cumbersome to use in most time-dependent applications. We’ll show how to reshape an example data extract into a much friendlier long format using the function tidyr::pivot_longer.

Data Availability

The samples listed above contain data from nearly identical survey questions. In the interview, a respondent is first asked how long they have lived in their current place of residence; if they indicate that they have not “always” lived in the same place, they are then asked how many places they’ve lived for more than six months after age 15 or their first marriage (whichever happened first).

Respondents who list at least one such place are then asked to recall information about each place, starting with the place before their current residence. Information about the most recent place is represented by variables beginning with the prefix PLACE1. Information about the second most recent place is represented by the prefix PLACE2, and so forth.

In each of these samples, the same questions are repeated for each place until all of the respondent’s previous places of residence are fully enumerated. The available information about each place includes:

Additionally, the respondent could list multiple reasons for migrating to each place (note that this information is not available for the respondent’s migration to their current place of residence). Options include:

Because the respondent can choose multiple reasons, we’ll find one binary indicator for each of these 18 reasons. As you might imagine, this results in a very wide dataset! Some respondents move as many as 11 times, resulting in 198 columns from just this one repeated multiple-response question.

The wide shape of these data is more than an inconvenience: most longitudinal analysis applications require easy access to the time interval between events. In their current format, each numbered PLACE variable represents a single migration event, but it’s difficult to tell how much time passed between any two migrations for a given person. To make this kind of analysis possible, we need to pivot the migration variables into a long format accompanied by a new variable showing the time interval between each migration.

Longitudinal Data Structures

Let’s take a look at the way migration history variables are currently formatted for one of the samples we discussed above. For this example, we’ve created a data extract containing all of the available migration data for the Kenya 2019 sample (female respondents only). We’ll load it and the following packages into R:

library(tidyverse)
library(ipumsr)

dat <- read_ipums_micro(
  ddi = "data/pma_00022.xml",
  data = "data/pma_00022.dat.gz"
)

Like all IPUMS PMA data extracts, this dataset reflects a cross-sectional survey design where every response from each person is stored in a single row. If you’re familiar with longitudinal data structures, you know that repeated observations from the same respondents are stored in separate rows, where each row represents a different moment in time. Why might this be the case?

As we’ll see in our own data, the values for repeated observations and the amount of time that passes between observations are related only by a common pattern in the names for each variable when they’re stored together in a wide format. In our case, the only mark of time between migrations is the respondent’s age. Consider the following respondents, who have each migrated at least twice:

dat %>% 
  filter(PLACELIVENUM %in% 2:7) %>% 
  select(ends_with("DISTRICTKE"), ends_with("MOVEAGE")) %>% 
  relocate(starts_with("PLACE1"), starts_with("PLACE2"))
# A tibble: 1,206 x 14
  PLACE1DISTRICTKE PLACE1MOVEAGE PLACE2DISTRICTKE PLACE2MOVEAGE
         <int+lbl>     <int+lbl>        <int+lbl>     <int+lbl>
1  7 [Nandi]                  29    32 [Migori]              19
2  8 [Nyamira]                19     8 [Nyamira]             21
3  6 [Nairobi]                34     3 [Kiambu]              28
4  6 [Nairobi]                25    35 [Nakuru]               0
5 43 [Trans-Nzoia]            15    28 [Machakos]             0
# … with 1,201 more rows, and 10 more variables:
#   PLACE3DISTRICTKE <int+lbl>, PLACE4DISTRICTKE <int+lbl>,
#   PLACE5DISTRICTKE <int+lbl>, PLACE6DISTRICTKE <int+lbl>,
#   PLACE7DISTRICTKE <int+lbl>, PLACE3MOVEAGE <int+lbl>,
#   PLACE4MOVEAGE <int+lbl>, PLACE5MOVEAGE <int+lbl>,
#   PLACE6MOVEAGE <int+lbl>, PLACE7MOVEAGE <int+lbl>

PLACE1DISTRICTKE shows the administrative district of the last place a respondent lived before their current place of residence, and PLACE1MOVEAGE shows her age when she moved there. PLACE2DISTRICTKE shows the district of the second most recent place she lived, and PLACE2MOVEAGE shows her age when she moved there. The same pattern would be repeated for all of the places a person might have lived (in this sample, some respondents migrated as many as 7 times).

Suppose you wanted to know something very simple about the relationship between these variables, such as the average age of female migrants arriving at each district in the sample. In this wide format, you would first have to find the mean PLACE1MOVEAGE for every district in PLACE1DISTRICTKE, then the mean PLACE2MOVEAGE for every district in PLACE2DISTRICTKE, and so forth for all 7 places. Then, you’d need to find the frequency weighted mean for each district in all 7 places. That’s quite a bit of extra work for just one simple statistic!

Imagine you wanted to model the effect of a time-dependent variable on an outcome of interest. For example, you might suppose that the number of times a female respondent gives birth could be related to the length of time she’s lived in a district where there are relatively few family planning services available. As you can see, we’d have a hard time building an appropriate model because the relevant data are currently strewn across 14 different variables. Instead, we’d much prefer two work with the data in a long format with only two variables: one representing DISTRICTKE and one representing MOVEAGE.

Pivot Longer into One Column

For the moment, let’s continue working just with the district for each place in an individual’s migration history. To keep things simple, we’ll create a dataset called district containing only the variables ending with the string DISTRICTKE.

district <- dat %>% select(ends_with("DISTRICTKE")) 

district
# A tibble: 9,549 x 7
   PLACE1DISTRICTKE PLACE2DISTRICTKE PLACE3DISTRICTKE PLACE4DISTRICTKE
          <int+lbl>        <int+lbl>        <int+lbl>        <int+lbl>
1 99 [NIU (not in … 99 [NIU (not in… 99 [NIU (not in… 99 [NIU (not in…
2  6 [Nairobi]      99 [NIU (not in… 99 [NIU (not in… 99 [NIU (not in…
3 99 [NIU (not in … 99 [NIU (not in… 99 [NIU (not in… 99 [NIU (not in…
4 99 [NIU (not in … 99 [NIU (not in… 99 [NIU (not in… 99 [NIU (not in…
5 99 [NIU (not in … 99 [NIU (not in… 99 [NIU (not in… 99 [NIU (not in…
# … with 9,544 more rows, and 3 more variables:
#   PLACE5DISTRICTKE <int+lbl>, PLACE6DISTRICTKE <int+lbl>,
#   PLACE7DISTRICTKE <int+lbl>

As you can see, a lot of the information contained in district isn’t really necessary. Every row holds information on 7 places of residence, but most respondents migrated only to 1 or 2 places if they ever migrated at all. The best approach here is to tell R that labels like NIU (not in universe) and No response or missing each represent a type of missing data that we can recode as NA. We can do that with help from ipumsr::lbl_na_if and dplyr::across:

district <- district %>% 
  mutate(across(everything(), ~{
    lbl_na_if(.x, ~.lbl %in% c(
      "No response or missing",
      "NIU (not in universe)"
    ))
  }))

district
# A tibble: 9,549 x 7
  PLACE1DISTRICTKE PLACE2DISTRICTKE PLACE3DISTRICTKE PLACE4DISTRICTKE
         <int+lbl>        <int+lbl>        <int+lbl>        <int+lbl>
1     NA                         NA               NA               NA
2      6 [Nairobi]               NA               NA               NA
3     NA                         NA               NA               NA
4     NA                         NA               NA               NA
5     NA                         NA               NA               NA
# … with 9,544 more rows, and 3 more variables:
#   PLACE5DISTRICTKE <int+lbl>, PLACE6DISTRICTKE <int+lbl>,
#   PLACE7DISTRICTKE <int+lbl>

In order to keep track of individuals, let’s also add a column ID that represents a short identification number for each person:

district <- district %>% rowid_to_column("ID")
district
# A tibble: 9,549 x 8
     ID PLACE1DISTRICTKE PLACE2DISTRICTKE PLACE3DISTRICTKE
  <int>        <int+lbl>        <int+lbl>        <int+lbl>
1     1     NA                         NA               NA
2     2      6 [Nairobi]               NA               NA
3     3     NA                         NA               NA
4     4     NA                         NA               NA
5     5     NA                         NA               NA
# … with 9,544 more rows, and 4 more variables:
#   PLACE4DISTRICTKE <int+lbl>, PLACE5DISTRICTKE <int+lbl>,
#   PLACE6DISTRICTKE <int+lbl>, PLACE7DISTRICTKE <int+lbl>

Now, we’re ready to use the function tidyr::pivot_longer to reshape district. The simplest way to use this function is to specify a group of columns with the argument cols:

district %>% pivot_longer(cols = ends_with("DISTRICTKE"))
# A tibble: 66,843 x 3
      ID name                    value
   <int> <chr>               <int+lbl>
 1     1 PLACE1DISTRICTKE NA          
 2     1 PLACE2DISTRICTKE NA          
 3     1 PLACE3DISTRICTKE NA          
 4     1 PLACE4DISTRICTKE NA          
 5     1 PLACE5DISTRICTKE NA          
 6     1 PLACE6DISTRICTKE NA          
 7     1 PLACE7DISTRICTKE NA          
 8     2 PLACE1DISTRICTKE  6 [Nairobi]
 9     2 PLACE2DISTRICTKE NA          
10     2 PLACE3DISTRICTKE NA          
# … with 66,833 more rows

By default, the name of each column moves into a single column called name, and the value of each column moves into an adjascent column called value. We can manually change the names of these columns with the arguments names_to and values_to:

district %>% pivot_longer(
  cols = ends_with("DISTRICTKE"),
  names_to = "PLACE",
  values_to = "DISTRICTKE"
)
# A tibble: 66,843 x 3
      ID PLACE              DISTRICTKE
   <int> <chr>               <int+lbl>
 1     1 PLACE1DISTRICTKE NA          
 2     1 PLACE2DISTRICTKE NA          
 3     1 PLACE3DISTRICTKE NA          
 4     1 PLACE4DISTRICTKE NA          
 5     1 PLACE5DISTRICTKE NA          
 6     1 PLACE6DISTRICTKE NA          
 7     1 PLACE7DISTRICTKE NA          
 8     2 PLACE1DISTRICTKE  6 [Nairobi]
 9     2 PLACE2DISTRICTKE NA          
10     2 PLACE3DISTRICTKE NA          
# … with 66,833 more rows

Even more conveniently, we can generate these columns automatically if we identify a pattern in the original column names. This approach efficiently handles both the names of the new columns and the values stored in each column. Notice that we’ve manually specified the name of the column DISTRICTKE above; this is fine if we’re only pivoting one column, but we want to avoid manually writing a name for each new column when we start working with several variables at once. Also, notice the values that appear in the PLACE column; wouldn’t it be more convenient to extract the index number for each place, rather than the full names of the original columns?

We’ll use the additional argument names_pattern to solve both problems at once. Any string enclosed with parentheses () in names_pattern can be passed, in sequential order, to names_to. In this example, we specify a pattern where the string PLACE will be followed by a single-digit number ([0-9]) followed by the string (DISTRICTKE). The argument names_to places the single-digit number in a column called PLACE, and it places the string DISTRICTKE in a column that uses a pronoun .value to represent the contents of the string.

district %>% 
  pivot_longer(
    cols = ends_with("DISTRICTKE"), 
    names_pattern = "PLACE([0-9])(DISTRICTKE)",
    names_to = c("PLACE", ".value")
  ) 
# A tibble: 66,843 x 3
      ID PLACE   DISTRICTKE
   <int> <chr>    <int+lbl>
 1     1 1     NA          
 2     1 2     NA          
 3     1 3     NA          
 4     1 4     NA          
 5     1 5     NA          
 6     1 6     NA          
 7     1 7     NA          
 8     2 1      6 [Nairobi]
 9     2 2     NA          
10     2 3     NA          
# … with 66,833 more rows

We can improve the scalability of this code just a little bit more by adding the wildcard . in names_pattern to represent “any character” and the operator * to represent “any number of times”. This allows us to write ([0-9]*) to find an integer of any length (in case some respondents move 10 places or more), and (.*) to find a string of any length afterward (this saves us the trouble of writing “DISTRICTKE”).

district %>% 
  pivot_longer(
    cols = ends_with("DISTRICTKE"), 
    names_pattern = "PLACE([0-9]*)(.*)",
    names_to = c("PLACE", ".value")
  ) 
# A tibble: 66,843 x 3
      ID PLACE   DISTRICTKE
   <int> <chr>    <int+lbl>
 1     1 1     NA          
 2     1 2     NA          
 3     1 3     NA          
 4     1 4     NA          
 5     1 5     NA          
 6     1 6     NA          
 7     1 7     NA          
 8     2 1      6 [Nairobi]
 9     2 2     NA          
10     2 3     NA          
# … with 66,833 more rows

Notice that there are now 66,843 rows in district: that’s 7 rows for 7 places per respondent. Adding the argument values_drop_NA = TRUE drops placeholder values for respondents who lived in fewer than 7 places:

district %>% 
  pivot_longer(
    cols = ends_with("DISTRICTKE"), 
    names_pattern = "PLACE([0-9]*)(.*)",
    names_to = c("PLACE", ".value"),
    values_drop_na = TRUE
  ) 
# A tibble: 5,165 x 3
      ID PLACE    DISTRICTKE
   <int> <chr>     <int+lbl>
 1     2 1      6 [Nairobi] 
 2    11 1     10 [Kakamega]
 3    12 1     10 [Kakamega]
 4    14 1      7 [Nandi]   
 5    14 2     32 [Migori]  
 6    16 1      6 [Nairobi] 
 7    19 1     33 [Mombasa] 
 8    20 1      8 [Nyamira] 
 9    20 2      8 [Nyamira] 
10    21 1      6 [Nairobi] 
# … with 5,155 more rows

This step causes any respondent who has never migrated from a place they lived for 6 months or more after age 15 / first marriage to be filtered out of the data. Here, we see the first 10 rows from all of the remaining female respondents. Individuals 14 and 20 lived in two such places: individual 14 first lived in the district Migori, then moved to Nandi, and finally moved to her current residence (not shown). Individual 20 first lived in Nyamira, then moved to another place also in Nyamira, and finally moved to her current residence (not shown). All of the other displayed respondents lived in exactly one such place. Next, we’ll add the age at which each of the women moved to each location.

Pivot Longer into Multiple Columns

Now that we know how to use wildcard operators in pivot_longer, we’ll be able to start pivoting multiple columns at once. Let’s start by adding the respondent’s age when they moved to each place. Using the same processing steps we used to make district, we’ll create a new dataset called age.

age <- dat %>% 
  select(ends_with("DISTRICTKE"), ends_with("MOVEAGE")) %>% 
  mutate(across(everything(), ~{
    lbl_na_if(.x, ~.lbl %in% c(
      "No response or missing",
      "NIU (not in universe)"
    ))
  })) %>% 
  rowid_to_column("ID")
age %>% relocate(ID, starts_with("PLACE1"), starts_with("PLACE2"))
# A tibble: 9,549 x 15
      ID PLACE1DISTRICTKE PLACE1MOVEAGE PLACE2DISTRICTKE PLACE2MOVEAGE
   <int>        <int+lbl>     <int+lbl>        <int+lbl>     <int+lbl>
 1     1     NA                      NA               NA            NA
 2     2      6 [Nairobi]            16               NA            NA
 3     3     NA                      NA               NA            NA
 4     4     NA                      NA               NA            NA
 5     5     NA                      NA               NA            NA
 6     6     NA                      NA               NA            NA
 7     7     NA                      NA               NA            NA
 8     8     NA                      NA               NA            NA
 9     9     NA                      NA               NA            NA
10    10     NA                      NA               NA            NA
# … with 9,539 more rows, and 10 more variables:
#   PLACE3DISTRICTKE <int+lbl>, PLACE4DISTRICTKE <int+lbl>,
#   PLACE5DISTRICTKE <int+lbl>, PLACE6DISTRICTKE <int+lbl>,
#   PLACE7DISTRICTKE <int+lbl>, PLACE3MOVEAGE <int+lbl>,
#   PLACE4MOVEAGE <int+lbl>, PLACE5MOVEAGE <int+lbl>,
#   PLACE6MOVEAGE <int+lbl>, PLACE7MOVEAGE <int+lbl>

Because we’re using the wildcard pattern (.*), the function will treat the string MOVEEAGE the same way it treats DISTRICTKE. We only need to add the new columns to cols:

age <- age %>% 
  pivot_longer(
    cols = c(ends_with("DISTRICTKE"), ends_with("MOVEAGE")), 
    names_pattern = "PLACE([0-9]*)(.*)",
    names_to = c("PLACE", ".value"),
    values_drop_na = TRUE
  ) 

age
# A tibble: 5,246 x 4
      ID PLACE    DISTRICTKE   MOVEAGE
   <int> <chr>     <int+lbl> <int+lbl>
 1     2 1      6 [Nairobi]         16
 2    11 1     10 [Kakamega]        17
 3    12 1     10 [Kakamega]        16
 4    14 1      7 [Nandi]           29
 5    14 2     32 [Migori]          19
 6    16 1      6 [Nairobi]         21
 7    19 1     33 [Mombasa]         31
 8    20 1      8 [Nyamira]         19
 9    20 2      8 [Nyamira]         21
10    21 1      6 [Nairobi]         34
# … with 5,236 more rows

The advantages we’ve gained with a longer data format are starting to become clear! Suppose you wanted to know the average age of migrants arriving at each of Kenya’s administrative districts in this sample. You could find this information easily with just one summarise function:

age %>% 
  group_by(DISTRICTKE) %>% 
  summarise(MEAN_AGE = mean(MOVEAGE, na.rm = T))
# A tibble: 48 x 2
      DISTRICTKE MEAN_AGE
       <int+lbl>    <dbl>
 1  1 [Bungoma]      18.5
 2  2 [Kericho]      18.0
 3  3 [Kiambu]       20.7
 4  4 [Kilifi]       17.1
 5  5 [Kitui]        19.3
 6  6 [Nairobi]      20.0
 7  7 [Nandi]        18.4
 8  8 [Nyamira]      17.9
 9  9 [Siaya]        16.1
10 10 [Kakamega]     17.1
# … with 38 more rows

Let’s now pivot all of the migration history columns in our original dataset dat. This time, we’ll specify that all of the desired cols start with the same prefix PLACE (but we’ll drop the column PLACELIVENUM, since it contains the string “PLACE” we’re using in names_pattern):

dat <- dat %>% 
  mutate(across(everything(), ~{
    lbl_na_if(.x, ~.lbl %in% c(
      "No response or missing",
      "NIU (not in universe)"
    ))
  })) %>% 
  rowid_to_column("ID") %>% 
  select(ID, starts_with("PLACE"), -PLACELIVENUM) %>% 
  pivot_longer(
    cols = starts_with("PLACE"), 
    names_pattern = "PLACE([0-9]*)(.*)",
    names_to = c("PLACE", ".value"),
    values_drop_na = TRUE
  )

dat
# A tibble: 5,251 x 24
      ID PLACE COUNTRY  DISTRICTKE MOVEAGE       UR YCHILDEDU YCOHABIT
   <int> <chr>   <int>   <int+lbl> <int+l> <int+lb> <int+lbl> <int+lb>
 1     2 1         404  6 [Nairob…      16 30 [Rur…    0 [No]   0 [No]
 2    11 1         404 10 [Kakame…      17 30 [Rur…    0 [No]   0 [No]
 3    12 1         404 10 [Kakame…      16 10 [Cit…    0 [No]   0 [No]
 4    14 1         404  7 [Nandi]       29 30 [Rur…    0 [No]   0 [No]
 5    14 2         404 32 [Migori]      19 10 [Cit…    0 [No]   0 [No]
 6    16 1         404  6 [Nairob…      21 30 [Rur…    0 [No]   0 [No]
 7    19 1         404 33 [Mombas…      31 20 [Per…    0 [No]   0 [No]
 8    20 1         404  8 [Nyamir…      19 10 [Cit…    0 [No]   0 [No]
 9    20 2         404  8 [Nyamir…      21 20 [Per…    0 [No]   0 [No]
10    21 1         404  6 [Nairob…      34 20 [Per…    0 [No]   0 [No]
# … with 5,241 more rows, and 16 more variables: YCONFLICT <int+lbl>,
#   YDIVORCE <int+lbl>, YFARM <int+lbl>, YHLTHACCESS <int+lbl>,
#   YHLTHPROB <int+lbl>, YJOBSEARCH <int+lbl>, YOTHER <int+lbl>,
#   YOTHERSOCIAL <int+lbl>, YPOSTMAR <int+lbl>,
#   YSCHOOLATTEND <int+lbl>, …

We’re left with a very manageable 24 migration history variables. Among these, all of the variables starting with Y indicate a possible reason “why” a respondent migrated to a particular PLACE. The simplest way to work with these Y variables is to use tidy selection functions, like starts_with("Y"). For example, suppose you wanted to know the percentage of all migrations in the sample that happened for all of the available reasons:

dat %>% 
  summarise(across(starts_with("Y"), ~100*mean(.x))) %>% 
  glimpse()
Rows: 1
Columns: 18
$ YCHILDEDU     <dbl> 1.980575
$ YCOHABIT      <dbl> 2.418587
$ YCONFLICT     <dbl> 8.931632
$ YDIVORCE      <dbl> 0.8188916
$ YFARM         <dbl> 4.05637
$ YHLTHACCESS   <dbl> 1.866311
$ YHLTHPROB     <dbl> 0.9331556
$ YJOBSEARCH    <dbl> 16.85393
$ YOTHER        <dbl> 9.502952
$ YOTHERSOCIAL  <dbl> 12.53095
$ YPOSTMAR      <dbl> 24.01447
$ YSCHOOLATTEND <dbl> 22.73853
$ YSCHOOLDONE   <dbl> 4.361074
$ YSICKREL      <dbl> 1.371167
$ YSPOUSEJOB    <dbl> 2.704247
$ YWKCHANGE     <dbl> 1.885355
$ YWKNONSEAS    <dbl> 3.77071
$ YWKSEASON     <dbl> 6.189297

Now that we’ve reshaped our migration recall data from a wide format to a long format, obtaining this summary data is a snap. And, as we’ll see in an upcoming migration Data Analysis post, using these data in longitudinal analysis can be just as easy.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.