Reading CDC mortality files using R
Reading fixed-width text files might be challenging, specially when we don’t have a dictionary file. In this post, I show steps to read CDC files in a more systematic way. In this example, I import a compress mortality file (CMF 1979-1988) available here and whose codebook (or layout) is here.
To read this file, usually with extension .txt
or .dat
, I first need to know where each column starts and finishes. What I get from the pdf file is something like this:
The layout is usually a codebook in Word/PDF or just plain text file. Here, I copy the PDF text and put it in a plain text file. I use a text editor (e.g., Sublime Text) and regular expressions to extract the information I need.
I have to select every row with this pattern: 1-2 2 FIPS State code Numeric
. That is, a number followed by a hyphen (although not always, particularly when the width of the column is one), spaces, another number, spaces, and then any text. I use the following regular expression to get that pattern: (^[0-9]+).([0-9]+)\s+([0-9])\s+(.+)
. Using the Sublime package Filter Lines I get something like this (you can also just copy the selected lines):
1-2 2 FIPS State code Numeric
3-5 FIPS county code Numeric
6-9 4 Year of death Numeric
11-12 2 Age at death Numeric
13-16 4 ICD code for underlying cause-of-death 3 digits: Numeric
17-19 3 Cause-of-Death Recode Numeric
20-23 4 Number of deaths Numeric
This approach might be particularly useful when you have a long PDF/Word file and you want to extract most of the variables. You would need to adapt the regular expressions I’m using to the particular patterns of your codebook.
To simplify, I format this text as a comma-separated values file (csv). Replacing this regular expression ([0-9]+)(-)([0-9]+)(\s)([0-9]+)(\s)(.+)(\s)(Numeric)
by \1,\3,\5,\7,\9
I get:
1,2,2,FIPS State code,Numeric
3,5,3,FIPS county code,Numeric
6,9,4,Year of death,Numeric
11,12,2,Age at death,Numeric
13,16,4,ICD code for underlying cause-of-death 3 digits:,Numeric
17,19,3,Cause-of-Death Recode,Numeric
20,23,4,Number of deaths,Numeric
Then, I read the layout file:
# define names of columns
colnames <- c("start", "end", "width", "name", "type")
dict <- read.csv("data/dictMortality.csv", col.names = colnames, header = FALSE)
## start end width name type
## 1 1 2 2 FIPS State code Numeric
## 2 3 5 3 FIPS county code Numeric
## 3 6 9 4 Year of death Numeric
## 4 11 12 2 Age at death Numeric
## 5 13 16 4 ICD code for underlying cause-of-death 3 digits: Numeric
## 6 17 19 3 Cause-of-Death Recode Numeric
## 7 20 23 4 Number of deaths Numeric
Now, I can read the fixed-width data file. I use the readr package (in my experience relatively fast for big datasets ~ 1 GB).
library(readr)
# create name of variables
cnames <- c("state", "county", "year", "age", "icd", "cause", "deaths")
# read mortality file
mort <- read_fwf("data/mort7988.txt", fwf_positions(dict$start, dict$end, cnames))
## # A tibble: 8,776,385 x 7
## state county year age icd cause deaths
## <chr> <chr> <int> <chr> <chr> <chr> <int>
## 1 01 001 1979 04 5789 780 1
## 2 01 001 1979 04 7980 770 1
## 3 01 001 1979 08 8121 800 1
## 4 01 001 1979 09 3439 780 1
## 5 01 001 1979 09 8120 800 2
## 6 01 001 1979 09 8189 800 1
## 7 01 001 1979 10 1629 180 1
## 8 01 001 1979 10 2396 250 1
## 9 01 001 1979 10 4289 410 1
## 10 01 001 1979 10 8070 810 1
## # ... with 8,776,375 more rows
# year distribution
table(mort$year)
## 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
## 831605 854860 854198 850505 867280 875607 894176 905736 912551 929867
# number of deaths
sum(mort$deaths)
## [1] 20398153
Hopefully, this might save you some time!
Last Update: 06/29/2017