Maternal Mortality Ratio

The definition of maternal mortality is:

“A death of a woman while pregnant or within 42 days of termination of pregnancy, irrespective of the duration and site of pregnancy, from any cause related or aggravated by the pregnancy or its management, but not from accidental or incidental causes.”

The definition of maternal mortality ratio is:

“The number of maternal deaths during a given time period per 100,000 live births during the same time period.”

In this notebook we will create a long-term time-series of maternal mortality ratio. To do this we will combine four existing datasets:

We make the assumption that their methods are comparable and that the data can be combined without transformation.

In years where there is an overlap in both time-series we use the data from the WHO.

This document will go through the steps required to create the data for this chart:

Downloading the Gapminder data

Firstly we download the data from Gapminder and save it locally. Then we read in the Gapminder data, clean up the column names and take a look at the data.

Code
r = requests.get("https://www.gapminder.org/documentation/documentation/gapdata010.xls")
assert r.ok
Path("data/input/").mkdir(parents=True, exist_ok=True)
output = open("data/input/gapminder.xls", "wb")
output.write(r.content)
output.close()

Some preliminary cleaning of the data: 1) remove the first 16 rows, 2) remove white space from the country names and 3) remove rows where the maternal mortality rate is NA.

Code
df = pd.read_excel("data/input/gapminder.xls")
df_d = df.drop(df.index[0:16]).reset_index()
df_d["Country"] = df_d["Country"].str.strip()
df_d = df_d[~pd.isna(df_d["MMR"])]

Cleaning the Gapminder data

Important

There are some issues with the Gapminder data - we will manually clean them here.

  • The first three rows in the ‘year’ column seem to be wrong as it seems as if each row covers 110 years, but in another source they cover only 10 years.

  • I will replace them manually here with the middle of each decade, as is used in the other data from the same source.

  • There are cases where the year for Finland has been entered incorrectly three times, there are two 1772s, 1775s & 1967s, the second of each should be 1872, 1875 and 1957 respectively.

  • There are also two errors for New Zealand, and one error for both Sweden and the US.

  • Ireland, Denmark and Sri Lanka’s maternal mortality rate drops to zero at some points - we will remove these.

Code
df_d.loc[:2, "year"] = ["1875", "1885", "1895"]

df_d.loc[
    (df_d["year"] == 1772)
    & (df_d["Country"] == "Finland")
    & (df_d["Maternal deaths"] == 487),
    "year",
] = 1872
df_d.loc[
    (df_d["year"] == 1775)
    & (df_d["Country"] == "Finland")
    & (df_d["Maternal deaths"] == 629),
    "year",
] = 1875
df_d.loc[
    (df_d["year"] == 1967)
    & (df_d["Country"] == "Finland")
    & (df_d["Maternal deaths"] == 77),
    "year",
] = 1957
df_d.loc[
    (df_d["year"] == 1967)
    & (df_d["Country"] == "Sweden")
    & (df_d["Maternal deaths"] == 39),
    "year",
] = 1957
df_d.loc[
    (df_d["year"] == 1967)
    & (df_d["Country"] == "United States")
    & (df_d["Maternal deaths"] == 1766.28),
    "year",
] = 1957

# Drop any rows where MMR is zero or '...' (Ireland, Denmark and Sri Lanka)
df_d = df_d.drop(df_d[(df_d["MMR"] == 0) | (df_d["MMR"] == '...')].index)

df_d = df_d.drop(
    df_d[
        (df_d["Country"] == "New Zealand")
        & (df_d["year"] == 1950)
        & (df_d["MMR"] == 90)
    ].index
)


# Duplicates after get_mid_year()
df_d = df_d.drop(
    df_d[
        (df_d["year"].isin(["1989-02", "1899-03"])) & (df_d["Country"] == "New Zealand")
    ].index
)

Check that there are no duplicates in the data.

Code
assert (
    df_d[df_d.duplicated(subset = ["Country", "year"],keep=False)].shape[0]
    == 0
)

Checking the given MMR matches a calculated MMR, where the data for live births and maternal deaths is given. It seems largely okay but the values for Ireland, particularly from 1901-1920. There is a drop of almost 1000 maternal deaths between 1920 and 1922.

In the registrar general of Ireland it states that there were approximately 599 maternal deaths in Ireland in 1912 and on average 634 per year between 1903 and 1911. The estimated rate is 6 deaths per 1,000 births, so 600 per 100,000 births.

I think this suggests that the ‘maternal deaths’ figures may be wrong for the period 1901-1920. I think they may have back-calculated the maternal deaths from the MMR but accidentally used the year column instead as the calculated MMR works out as the year value.

Additionally there are some rows for the UK which don’t quite match, but they are very close so we will continue with the original values.

Code
df_check = df_d[['Country','year','Live Births', 'Maternal deaths', 'MMR']].dropna().reset_index(drop=True)
df_check['mmr_calc'] = ((df_check['Maternal deaths']/df_check['Live Births'])* 100000).astype(float).round(2)
df_check['MMR'] = df_check['MMR'].astype(float).round(2)


df_check[ df_check['mmr_calc'].round(2)!= df_check['MMR'].round(2)]
Country year Live Births Maternal deaths MMR mmr_calc
597 Ireland 1901 71827.1584 1365.434281 618.0 1901.00
598 Ireland 1902 72580.042 1380.472399 623.0 1902.00
599 Ireland 1903 72616.1667 1381.885652 563.0 1903.00
600 Ireland 1904 74020.0312 1409.341394 562.0 1904.00
601 Ireland 1905 73242.8892 1395.277039 557.0 1905.00
602 Ireland 1906 73539.1845 1401.656857 586.0 1906.00
603 Ireland 1907 72435.2952 1381.341079 496.0 1907.00
604 Ireland 1908 72697.7941 1387.073911 520.0 1908.00
605 Ireland 1909 73043.1 1394.392779 546.0 1909.00
606 Ireland 1910 72697.7941 1388.527867 531.0 1910.00
607 Ireland 1911 72319.7592 1382.030598 505.0 1911.00
608 Ireland 1912 71483.563 1366.765725 543.0 1912.00
609 Ireland 1913 70505.0556 1348.761714 536.0 1913.00
610 Ireland 1914 69693.6314 1333.936105 521.0 1914.00
611 Ireland 1915 66966.746 1282.413186 530.0 1915.00
612 Ireland 1916 63544.0465 1217.503931 574.0 1916.00
613 Ireland 1917 59895.5845 1148.198355 485.0 1917.00
614 Ireland 1918 60298.2468 1156.520374 483.0 1918.00
615 Ireland 1919 61931.92 1188.473545 471.0 1919.00
616 Ireland 1920 70748.4 1358.36928 487.0 1920.00
617 Ireland 1922 58849 370 566.0 628.73
618 Ireland 1923 61690 328 482.0 531.69
619 Ireland 1924 63402 330 478.0 520.49
620 Ireland 1925 62069 312 469.0 502.67
621 Ireland 1926 61176 329 489.0 537.79
622 Ireland 1927 60054 291 451.0 484.56
623 Ireland 1928 59176 318 493.0 537.38
624 Ireland 1929 58280 283 410.0 485.59
625 Ireland 1930 58333 294 477.0 504.00
626 Ireland 1931 57086 272 431.0 476.47
627 Ireland 1932 56240 280 455.0 497.87
628 Ireland 1933 57364 296 444.0 516.00
629 Ireland 1934 57897 304 468.0 525.07
630 Ireland 1935 58266 297 467.0 509.73
631 Ireland 1936 58115 299 470.0 514.50
632 Ireland 1937 56488 237 361.0 419.56
633 Ireland 1938 56925 267 411.0 469.04
634 Ireland 1939 56070 218 339.0 388.80
635 Ireland 1940 56394 227 367.0 402.53
636 Ireland 1941 56780 209 321.0 368.09
637 Ireland 1942 66117 189 247.0 285.86
638 Ireland 1943 64375 162 225.0 251.65
639 Ireland 1944 65425 176 238.0 269.01
640 Ireland 1945 66861 176 237.0 263.23
641 Ireland 1946 67922 163 201.0 239.98
642 Ireland 1947 68978 148 188.0 214.56
643 Ireland 1948 65930 124 158.0 188.08
644 Ireland 1949 64153 129 181.0 201.08
645 Ireland 1950 63565 99 156.0 155.75
1487 United Kingdom 1951 677529 526 76.0 77.64
1488 United Kingdom 1952 673735 463 67.0 68.72
1489 United Kingdom 1953 684372 495 71.0 72.33
1490 United Kingdom 1954 673651 405 65.0 60.12

Finding the mid-point of each year range

For some rows a range of years is given for a particular maternal mortality rate. We want to find the mid-point of that range. We use this function to find the mid-value of years given.

Code
def get_mid_year(year: str):
    if "-" in str(year):
        year_begin = year.split("-")[0]
        year_end = year.split("-")[1]
        year_end_len = len(year.split("-")[1])
        year_end_pref = 4 - year_end_len
        if year_end_pref > 0:
            year_pref = year_begin[0:year_end_pref]
            year_end = year_pref + year_end
            year_out = int(np.mean([int(year_begin), int(year_end)]))
        else:
            year_out = int(np.mean([int(year_begin), int(year_end)]))
    else:
        year_begin = int(year)
        year_out = int(year)

    return pd.Series([year_begin, year_out])

Let’s check the mid-year estimates have worked as expected. There is one case where the mid-year estimate is not the mid-point of the range of years. Let’s fix this manually.

Code
df_years = df_d["year"].apply(get_mid_year)
df_d["year_begin"] = df_years[0].astype(int)
df_d["mid_year"] = df_years[1].astype(int)

df_d[df_d["year_begin"] > df_d["mid_year"]]

df_d.loc[
    (df_d["year"] == "1897-01") & (df_d["Country"] == "New Zealand"), "mid_year"
] = 1899
assert(df_d[df_d[["Country", "year"]].duplicated(keep=False)].shape[0] == 0)

Rename the columns, keeping only the columns we are interested in and standardise the country names and drop rows with NA values.

Code
df_d = df_d.copy(deep=False)
df_d["year"] = df_d["mid_year"]

df_gap = df_d[["Country", "year", "MMR"]]
df_gap["source"] = "gapminder"
df_gap.rename(
    columns={"Country": "entity", "MMR": "maternal_mortality_rate"}, inplace=True
)

stand_countries = pd.read_csv(
    "data/input/countries_to_standardise_country_standardized.csv"
)
stan_dict = stand_countries.set_index("Country").squeeze().to_dict()
df_gap["entity"] = standardise_entities(df_gap["entity"], stan_dict)
df_gap = df_gap.dropna(subset=['maternal_mortality_rate'])
assert df_gap['entity'].isna().sum() == 0

Read in the OECD Stat data

Add in the OECD Stat Health data - this is not available on the OECD stat API so we manually download it from here: https://stats.oecd.org/index.aspx?queryid=30116

Code
oecd_df = pd.read_csv("data/input/HEALTH_STAT_08082022144439675.csv")
oecd_df = oecd_df[(oecd_df['VAR'] == 'MATIMATM')]
oecd_df = oecd_df[["Country", "Year", "Value"]].rename(
    columns={"Country": "entity", "Year": "year", "Value": "maternal_mortality_rate"}
)
oecd_df["source"] = "oecd"

oecd_df["entity"] = standardise_entities(oecd_df["entity"], stan_dict)
assert oecd_df['entity'].isna().sum() == 0

Read in the latest WHO data

Read in data from WDI - in future iterations we will read this in from ETL so it is auto updated. The WDI sources this variable from the WHO so we will henceforth refer to it as the WHO data.

Code
try:
    r_who = requests.get(
        "https://api.worldbank.org/v2/en/indicator/SH.STA.MMRT?downloadformat=csv"
    )
except requests.exceptions.RequestException as e:
    print(e)
    raise SystemExit(e) 

output = open("data/input/who_mmr.zip", "wb")
output.write(r_who.content)
output.close()
with zipfile.ZipFile("data/input/who_mmr.zip", "r") as zip_ref:
    zip_ref.extractall("data/input/who_mmr")
who = pd.read_csv(
    "data/input/who_mmr/API_SH.STA.MMRT_DS2_en_csv_v2_4252399.csv", skiprows=4
)

Cleaning the latest WHO data

Pivot and clean the WHO data, standardise the country names and drop values that are NA.

Code
year_cols = list(range(1960, 2022))
year_cols = [str(int) for int in year_cols]

who_df = pd.melt(who, id_vars=["Country Name"], value_vars=year_cols)
who_df.rename(
    columns={
        "Country Name": "entity",
        "variable": "year",
        "value": "maternal_mortality_rate",
    },
    inplace=True,
)
who_df["source"] = "who"
who_df["entity"] = standardise_entities(who_df["entity"], stan_dict)
who_df['year'] = who_df['year'].astype(int)
who_df = who_df.dropna(subset=['maternal_mortality_rate'])

Remove entities that we aren’t interested in displaying on grapher as it is difficult to clearly define them.

Code
#%%[markdown]
# ## Cleaning geographic entities
# We don't want to include all of the available regions in the data as it is difficult for us to define them clearly, so we drop out a selection here.
entities_to_drop = [
    "Africa Eastern and Southern",
    "Africa Western and Central",
    "Arab World",
    "Central Europe and the Baltics",
    "Early-demographic dividend",
    "East Asia & Pacific (excluding high income)",
    "East Asia & Pacific (IDA & IBRD countries)",
    "Euro area",
    "Europe & Central Asia (excluding high income)",
    "Europe & Central Asia (IDA & IBRD countries)",
    "European Union",
    "Fragile and conflict affected situations",
    "Heavily indebted poor countries (HIPC)",
    "IBRD only",
    "IDA & IBRD total",
    "IDA blend",
    "IDA only",
    "IDA total",
    "Late-demographic dividend",
    "Latin America & Caribbean (excluding high income)",
    "Latin America & the Caribbean (IDA & IBRD countries)",
    "Least developed countries: UN classification",
    "Middle East & North Africa (excluding high income)",
    "Middle East & North Africa (IDA & IBRD countries)",
    "Not classified",
    "OECD members",
    "Other small states",
    "Pacific island small states",
    "Post-demographic dividend",
    "Pre-demographic dividend",
    "Small states",
    "South Asia (IDA & IBRD)",
    "Sub-Saharan Africa (excluding high income)",
    "Sub-Saharan Africa (IDA & IBRD countries)",
]
who_df = who_df[~who_df["entity"].isin(entities_to_drop)]

Combine the three datasets: Gapminder, OECD Stat and WHO

Code
df = pd.concat([df_gap, who_df, oecd_df], ignore_index=True)
df["year"] = df["year"].astype(int)
df['maternal_mortality_rate'] = df['maternal_mortality_rate'].astype(float)

Identify entities and years which have data from multiple datasets so that we can plot these.

Code
dup_ents = df[['entity', 'year']][df.duplicated(subset=['entity', 'year'])].drop_duplicates()
dup_ents['ent_year'] = dup_ents['entity'] + '_' + dup_ents['year'].astype(str)

gap_ents = df['entity'][df['source'] == 'gapminder'].drop_duplicates().to_list()
oecd_ents = df['entity'][df['source'] == 'oecd'].drop_duplicates().to_list()
Code
df['ent_year'] = df['entity'] + '_' + df['year'].astype(str)

df = df[~((df['ent_year'].isin(dup_ents['ent_year'])) & (df['source'] == 'who'))]

Plot the countries which have data from Gapminder so we can inspect the full timeline. Where there are duplicate values sourced from WHO we drop these.

The data is shown from from 1950 onwards. WHO is shown in blue, Gapminder in green and OECD in red.

Code
df_dup = df[(df.entity.isin(gap_ents)) & (df.year >= 1950)]

plt.figure(figsize=(12,7), facecolor="white")

# plot numbering starts at 1, not 0
plot_number = 1
for countryname, selection in df_dup.groupby("entity"):
    # Inside of an image that's a 15x13 grid, put this
    # graph in the in the plot_number slot.
    ax = plt.subplot(4, 4, plot_number)
    selection_gap = selection[selection["source"] == "gapminder"]
    selection_who = selection[selection["source"] == "who"]
    selection_oecd = selection[selection["source"] == "oecd"]
    if selection_who.shape[0] > 0:
        selection_who.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'b',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_gap.shape[0] > 0:
        selection_gap.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'g',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_oecd.shape[0] > 0:
        selection_oecd.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'r',
            ax=ax,
            label=countryname,
            legend=False,
        )
    ax.set_title(countryname)
    # Go to the next plot for the next loop
    plot_number = plot_number + 1
plt.tight_layout()

The jump in the Sri Lanka and Malaysia trends are odd - we will drop the Gapminder data for this entity.

Code
df = df[~((df['entity'].isin(['Sri Lanka', 'Malaysia'])) & (df['source'] == 'gapminder'))]

Plot the countries which have data from OECD so we can inspect the full timeline. Where there are duplicate values sourced from WHO we drop these.

The data is shown from from 1950 onwards. WHO is shown in blue, Gapminder in green and OECD in red.

Code
df_dup = df[(df.entity.isin(oecd_ents)) & (df.year >= 1950)]

plt.figure(figsize=(12,7), facecolor="white")

# plot numbering starts at 1, not 0
plot_number = 1
for countryname, selection in df_dup.groupby("entity"):
    # Inside of an image that's a 15x13 grid, put this
    # graph in the in the plot_number slot.
    ax = plt.subplot(6, 7, plot_number)
    selection_gap = selection[selection["source"] == "gapminder"]
    selection_who = selection[selection["source"] == "who"]
    selection_oecd = selection[selection["source"] == "oecd"]
    if selection_who.shape[0] > 0:
        selection_who.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'b',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_gap.shape[0] > 0:
        selection_gap.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'g',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_oecd.shape[0] > 0:
        selection_oecd.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'r',
            ax=ax,
            label=countryname,
            legend=False,
        )
    ax.set_title(countryname)
    # Go to the next plot for the next loop
    plot_number = plot_number + 1
plt.tight_layout()

Turkey only has data every 8/9 years with the OECD but it’s not appropriate to fill in the gaps with the WHO data so we will drop the WHO data for Turkey.

Code
df = df[~((df['entity'].isin(['Turkey'])) & (df['source'] == 'who'))]

We’ll keep the full extent of the Gapminder data where possible as this will ensure the smoothest time-series.

Code
no_dups_df = df[~df.duplicated(subset=['entity', 'year'],keep= False)]
keep_gap = df[(df.duplicated(subset=['entity', 'year'], keep = False)) & (df.source == 'gapminder')]

df_clean = pd.concat([no_dups_df, keep_gap], ignore_index=True)

Check that there are no country-year duplicates in the final dataset and do some checks that the data doesn’t have any unexpected NAs in it.

Code
assert df_clean[(df_clean.duplicated(subset=['entity', 'year']))].shape[0] == 0
assert df_clean['entity'].isna().sum() == 0
assert df_clean['year'].isna().sum() == 0
assert df_clean['maternal_mortality_rate'].isna().sum() == 0

Some final cleaning for the main variable we are interested in.

Code
df_clean['maternal_mortality_rate'] = df_clean['maternal_mortality_rate'].astype(float)
df_clean['maternal_mortality_rate'] = round(df_clean['maternal_mortality_rate'], 2)
df_clean = df_clean[["entity", "year", "maternal_mortality_rate"]]

Add a column for each source, for comparison in grapher.

Code
who_df = who_df.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_who_2019"}).drop(columns=['source'])
oecd_df = oecd_df.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_oecd"}).drop(columns=['source'])
df_gap = df_gap.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_gapminder"}).drop(columns=['source'])
df_clean = df_clean.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_owid"})

dfs = [df_clean, who_df, oecd_df, df_gap]
df_merged = reduce(lambda left,right: pd. merge(left,right,on=['entity', 'year'],how='outer'), dfs)
df_merged = df_merged[~df_merged['entity'].isin(entities_to_drop)]

df_merged.to_csv("data/output/maternal_mortality_rate.csv", index=False)