Maternal Mortality Ratio

Downloading the Gapminder data

Firstly we download the data from Gapminder and save it locally. Then we read in the Gapminder data, clean up the column names and take a look at the data.

Code

r = requests.get("https://www.gapminder.org/documentation/documentation/gapdata010.xls")
assert r.ok
Path("data/input/").mkdir(parents=True, exist_ok=True)
output = open("data/input/gapminder.xls", "wb")
output.write(r.content)
output.close()

Some preliminary cleaning of the data: 1) remove the first 16 rows, 2) remove white space from the country names and 3) remove rows where the maternal mortality rate is NA.

Code

df = pd.read_excel("data/input/gapminder.xls")
df_d = df.drop(df.index[0:16]).reset_index()
df_d["Country"] = df_d["Country"].str.strip()
df_d = df_d[~pd.isna(df_d["MMR"])]

Cleaning the Gapminder data

Important

There are some issues with the Gapminder data - we will manually clean them here.

The first three rows in the ‘year’ column seem to be wrong as it seems as if each row covers 110 years, but in another source they cover only 10 years.
I will replace them manually here with the middle of each decade, as is used in the other data from the same source.
There are cases where the year for Finland has been entered incorrectly three times, there are two 1772s, 1775s & 1967s, the second of each should be 1872, 1875 and 1957 respectively.
There are also two errors for New Zealand, and one error for both Sweden and the US.
Ireland, Denmark and Sri Lanka’s maternal mortality rate drops to zero at some points - we will remove these.

Code

df_d.loc[:2, "year"] = ["1875", "1885", "1895"]

df_d.loc[
    (df_d["year"] == 1772)
    & (df_d["Country"] == "Finland")
    & (df_d["Maternal deaths"] == 487),
    "year",
] = 1872
df_d.loc[
    (df_d["year"] == 1775)
    & (df_d["Country"] == "Finland")
    & (df_d["Maternal deaths"] == 629),
    "year",
] = 1875
df_d.loc[
    (df_d["year"] == 1967)
    & (df_d["Country"] == "Finland")
    & (df_d["Maternal deaths"] == 77),
    "year",
] = 1957
df_d.loc[
    (df_d["year"] == 1967)
    & (df_d["Country"] == "Sweden")
    & (df_d["Maternal deaths"] == 39),
    "year",
] = 1957
df_d.loc[
    (df_d["year"] == 1967)
    & (df_d["Country"] == "United States")
    & (df_d["Maternal deaths"] == 1766.28),
    "year",
] = 1957

# Drop any rows where MMR is zero or '...' (Ireland, Denmark and Sri Lanka)
df_d = df_d.drop(df_d[(df_d["MMR"] == 0) | (df_d["MMR"] == '...')].index)

df_d = df_d.drop(
    df_d[
        (df_d["Country"] == "New Zealand")
        & (df_d["year"] == 1950)
        & (df_d["MMR"] == 90)
    ].index
)


# Duplicates after get_mid_year()
df_d = df_d.drop(
    df_d[
        (df_d["year"].isin(["1989-02", "1899-03"])) & (df_d["Country"] == "New Zealand")
    ].index
)

Check that there are no duplicates in the data.

Code

assert (
    df_d[df_d.duplicated(subset = ["Country", "year"],keep=False)].shape[0]
    == 0
)

Checking the given MMR matches a calculated MMR, where the data for live births and maternal deaths is given. It seems largely okay but the values for Ireland, particularly from 1901-1920. There is a drop of almost 1000 maternal deaths between 1920 and 1922.

In the registrar general of Ireland it states that there were approximately 599 maternal deaths in Ireland in 1912 and on average 634 per year between 1903 and 1911. The estimated rate is 6 deaths per 1,000 births, so 600 per 100,000 births.

I think this suggests that the ‘maternal deaths’ figures may be wrong for the period 1901-1920. I think they may have back-calculated the maternal deaths from the MMR but accidentally used the year column instead as the calculated MMR works out as the year value.

Additionally there are some rows for the UK which don’t quite match, but they are very close so we will continue with the original values.

Code

df_check = df_d[['Country','year','Live Births', 'Maternal deaths', 'MMR']].dropna().reset_index(drop=True)
df_check['mmr_calc'] = ((df_check['Maternal deaths']/df_check['Live Births'])* 100000).astype(float).round(2)
df_check['MMR'] = df_check['MMR'].astype(float).round(2)


df_check[ df_check['mmr_calc'].round(2)!= df_check['MMR'].round(2)]

	Country	year	Live Births	Maternal deaths	MMR	mmr_calc
597	Ireland	1901	71827.1584	1365.434281	618.0	1901.00
598	Ireland	1902	72580.042	1380.472399	623.0	1902.00
599	Ireland	1903	72616.1667	1381.885652	563.0	1903.00
600	Ireland	1904	74020.0312	1409.341394	562.0	1904.00
601	Ireland	1905	73242.8892	1395.277039	557.0	1905.00
602	Ireland	1906	73539.1845	1401.656857	586.0	1906.00
603	Ireland	1907	72435.2952	1381.341079	496.0	1907.00
604	Ireland	1908	72697.7941	1387.073911	520.0	1908.00
605	Ireland	1909	73043.1	1394.392779	546.0	1909.00
606	Ireland	1910	72697.7941	1388.527867	531.0	1910.00
607	Ireland	1911	72319.7592	1382.030598	505.0	1911.00
608	Ireland	1912	71483.563	1366.765725	543.0	1912.00
609	Ireland	1913	70505.0556	1348.761714	536.0	1913.00
610	Ireland	1914	69693.6314	1333.936105	521.0	1914.00
611	Ireland	1915	66966.746	1282.413186	530.0	1915.00
612	Ireland	1916	63544.0465	1217.503931	574.0	1916.00
613	Ireland	1917	59895.5845	1148.198355	485.0	1917.00
614	Ireland	1918	60298.2468	1156.520374	483.0	1918.00
615	Ireland	1919	61931.92	1188.473545	471.0	1919.00
616	Ireland	1920	70748.4	1358.36928	487.0	1920.00
617	Ireland	1922	58849	370	566.0	628.73
618	Ireland	1923	61690	328	482.0	531.69
619	Ireland	1924	63402	330	478.0	520.49
620	Ireland	1925	62069	312	469.0	502.67
621	Ireland	1926	61176	329	489.0	537.79
622	Ireland	1927	60054	291	451.0	484.56
623	Ireland	1928	59176	318	493.0	537.38
624	Ireland	1929	58280	283	410.0	485.59
625	Ireland	1930	58333	294	477.0	504.00
626	Ireland	1931	57086	272	431.0	476.47
627	Ireland	1932	56240	280	455.0	497.87
628	Ireland	1933	57364	296	444.0	516.00
629	Ireland	1934	57897	304	468.0	525.07
630	Ireland	1935	58266	297	467.0	509.73
631	Ireland	1936	58115	299	470.0	514.50
632	Ireland	1937	56488	237	361.0	419.56
633	Ireland	1938	56925	267	411.0	469.04
634	Ireland	1939	56070	218	339.0	388.80
635	Ireland	1940	56394	227	367.0	402.53
636	Ireland	1941	56780	209	321.0	368.09
637	Ireland	1942	66117	189	247.0	285.86
638	Ireland	1943	64375	162	225.0	251.65
639	Ireland	1944	65425	176	238.0	269.01
640	Ireland	1945	66861	176	237.0	263.23
641	Ireland	1946	67922	163	201.0	239.98
642	Ireland	1947	68978	148	188.0	214.56
643	Ireland	1948	65930	124	158.0	188.08
644	Ireland	1949	64153	129	181.0	201.08
645	Ireland	1950	63565	99	156.0	155.75
1487	United Kingdom	1951	677529	526	76.0	77.64
1488	United Kingdom	1952	673735	463	67.0	68.72
1489	United Kingdom	1953	684372	495	71.0	72.33
1490	United Kingdom	1954	673651	405	65.0	60.12

Finding the mid-point of each year range

For some rows a range of years is given for a particular maternal mortality rate. We want to find the mid-point of that range. We use this function to find the mid-value of years given.

Code

def get_mid_year(year: str):
    if "-" in str(year):
        year_begin = year.split("-")[0]
        year_end = year.split("-")[1]
        year_end_len = len(year.split("-")[1])
        year_end_pref = 4 - year_end_len
        if year_end_pref > 0:
            year_pref = year_begin[0:year_end_pref]
            year_end = year_pref + year_end
            year_out = int(np.mean([int(year_begin), int(year_end)]))
        else:
            year_out = int(np.mean([int(year_begin), int(year_end)]))
    else:
        year_begin = int(year)
        year_out = int(year)

    return pd.Series([year_begin, year_out])

Let’s check the mid-year estimates have worked as expected. There is one case where the mid-year estimate is not the mid-point of the range of years. Let’s fix this manually.

Code

df_years = df_d["year"].apply(get_mid_year)
df_d["year_begin"] = df_years[0].astype(int)
df_d["mid_year"] = df_years[1].astype(int)

df_d[df_d["year_begin"] > df_d["mid_year"]]

df_d.loc[
    (df_d["year"] == "1897-01") & (df_d["Country"] == "New Zealand"), "mid_year"
] = 1899
assert(df_d[df_d[["Country", "year"]].duplicated(keep=False)].shape[0] == 0)

Rename the columns, keeping only the columns we are interested in and standardise the country names and drop rows with NA values.

Code

df_d = df_d.copy(deep=False)
df_d["year"] = df_d["mid_year"]

df_gap = df_d[["Country", "year", "MMR"]]
df_gap["source"] = "gapminder"
df_gap.rename(
    columns={"Country": "entity", "MMR": "maternal_mortality_rate"}, inplace=True
)

stand_countries = pd.read_csv(
    "data/input/countries_to_standardise_country_standardized.csv"
)
stan_dict = stand_countries.set_index("Country").squeeze().to_dict()
df_gap["entity"] = standardise_entities(df_gap["entity"], stan_dict)
df_gap = df_gap.dropna(subset=['maternal_mortality_rate'])
assert df_gap['entity'].isna().sum() == 0

Read in the OECD Stat data

Add in the OECD Stat Health data - this is not available on the OECD stat API so we manually download it from here: https://stats.oecd.org/index.aspx?queryid=30116

Code

oecd_df = pd.read_csv("data/input/HEALTH_STAT_08082022144439675.csv")
oecd_df = oecd_df[(oecd_df['VAR'] == 'MATIMATM')]
oecd_df = oecd_df[["Country", "Year", "Value"]].rename(
    columns={"Country": "entity", "Year": "year", "Value": "maternal_mortality_rate"}
)
oecd_df["source"] = "oecd"

oecd_df["entity"] = standardise_entities(oecd_df["entity"], stan_dict)
assert oecd_df['entity'].isna().sum() == 0

Read in the latest WHO data

Read in data from WDI - in future iterations we will read this in from ETL so it is auto updated. The WDI sources this variable from the WHO so we will henceforth refer to it as the WHO data.

Code

try:
    r_who = requests.get(
        "https://api.worldbank.org/v2/en/indicator/SH.STA.MMRT?downloadformat=csv"
    )
except requests.exceptions.RequestException as e:
    print(e)
    raise SystemExit(e) 

output = open("data/input/who_mmr.zip", "wb")
output.write(r_who.content)
output.close()
with zipfile.ZipFile("data/input/who_mmr.zip", "r") as zip_ref:
    zip_ref.extractall("data/input/who_mmr")
who = pd.read_csv(
    "data/input/who_mmr/API_SH.STA.MMRT_DS2_en_csv_v2_4252399.csv", skiprows=4
)

Cleaning the latest WHO data

Pivot and clean the WHO data, standardise the country names and drop values that are NA.

Code

year_cols = list(range(1960, 2022))
year_cols = [str(int) for int in year_cols]

who_df = pd.melt(who, id_vars=["Country Name"], value_vars=year_cols)
who_df.rename(
    columns={
        "Country Name": "entity",
        "variable": "year",
        "value": "maternal_mortality_rate",
    },
    inplace=True,
)
who_df["source"] = "who"
who_df["entity"] = standardise_entities(who_df["entity"], stan_dict)
who_df['year'] = who_df['year'].astype(int)
who_df = who_df.dropna(subset=['maternal_mortality_rate'])

Remove entities that we aren’t interested in displaying on grapher as it is difficult to clearly define them.

Code

#%%[markdown]
# ## Cleaning geographic entities
# We don't want to include all of the available regions in the data as it is difficult for us to define them clearly, so we drop out a selection here.
entities_to_drop = [
    "Africa Eastern and Southern",
    "Africa Western and Central",
    "Arab World",
    "Central Europe and the Baltics",
    "Early-demographic dividend",
    "East Asia & Pacific (excluding high income)",
    "East Asia & Pacific (IDA & IBRD countries)",
    "Euro area",
    "Europe & Central Asia (excluding high income)",
    "Europe & Central Asia (IDA & IBRD countries)",
    "European Union",
    "Fragile and conflict affected situations",
    "Heavily indebted poor countries (HIPC)",
    "IBRD only",
    "IDA & IBRD total",
    "IDA blend",
    "IDA only",
    "IDA total",
    "Late-demographic dividend",
    "Latin America & Caribbean (excluding high income)",
    "Latin America & the Caribbean (IDA & IBRD countries)",
    "Least developed countries: UN classification",
    "Middle East & North Africa (excluding high income)",
    "Middle East & North Africa (IDA & IBRD countries)",
    "Not classified",
    "OECD members",
    "Other small states",
    "Pacific island small states",
    "Post-demographic dividend",
    "Pre-demographic dividend",
    "Small states",
    "South Asia (IDA & IBRD)",
    "Sub-Saharan Africa (excluding high income)",
    "Sub-Saharan Africa (IDA & IBRD countries)",
]
who_df = who_df[~who_df["entity"].isin(entities_to_drop)]

Combine the three datasets: Gapminder, OECD Stat and WHO

Code

df = pd.concat([df_gap, who_df, oecd_df], ignore_index=True)
df["year"] = df["year"].astype(int)
df['maternal_mortality_rate'] = df['maternal_mortality_rate'].astype(float)

Identify entities and years which have data from multiple datasets so that we can plot these.

Code

dup_ents = df[['entity', 'year']][df.duplicated(subset=['entity', 'year'])].drop_duplicates()
dup_ents['ent_year'] = dup_ents['entity'] + '_' + dup_ents['year'].astype(str)

gap_ents = df['entity'][df['source'] == 'gapminder'].drop_duplicates().to_list()
oecd_ents = df['entity'][df['source'] == 'oecd'].drop_duplicates().to_list()

Code

df['ent_year'] = df['entity'] + '_' + df['year'].astype(str)

df = df[~((df['ent_year'].isin(dup_ents['ent_year'])) & (df['source'] == 'who'))]

Plot the countries which have data from Gapminder so we can inspect the full timeline. Where there are duplicate values sourced from WHO we drop these.

The data is shown from from 1950 onwards. WHO is shown in blue, Gapminder in green and OECD in red.

Code

df_dup = df[(df.entity.isin(gap_ents)) & (df.year >= 1950)]

plt.figure(figsize=(12,7), facecolor="white")

# plot numbering starts at 1, not 0
plot_number = 1
for countryname, selection in df_dup.groupby("entity"):
    # Inside of an image that's a 15x13 grid, put this
    # graph in the in the plot_number slot.
    ax = plt.subplot(4, 4, plot_number)
    selection_gap = selection[selection["source"] == "gapminder"]
    selection_who = selection[selection["source"] == "who"]
    selection_oecd = selection[selection["source"] == "oecd"]
    if selection_who.shape[0] > 0:
        selection_who.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'b',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_gap.shape[0] > 0:
        selection_gap.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'g',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_oecd.shape[0] > 0:
        selection_oecd.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'r',
            ax=ax,
            label=countryname,
            legend=False,
        )
    ax.set_title(countryname)
    # Go to the next plot for the next loop
    plot_number = plot_number + 1
plt.tight_layout()

The jump in the Sri Lanka and Malaysia trends are odd - we will drop the Gapminder data for this entity.

Code

df = df[~((df['entity'].isin(['Sri Lanka', 'Malaysia'])) & (df['source'] == 'gapminder'))]

Plot the countries which have data from OECD so we can inspect the full timeline. Where there are duplicate values sourced from WHO we drop these.

The data is shown from from 1950 onwards. WHO is shown in blue, Gapminder in green and OECD in red.

Code

df_dup = df[(df.entity.isin(oecd_ents)) & (df.year >= 1950)]

plt.figure(figsize=(12,7), facecolor="white")

# plot numbering starts at 1, not 0
plot_number = 1
for countryname, selection in df_dup.groupby("entity"):
    # Inside of an image that's a 15x13 grid, put this
    # graph in the in the plot_number slot.
    ax = plt.subplot(6, 7, plot_number)
    selection_gap = selection[selection["source"] == "gapminder"]
    selection_who = selection[selection["source"] == "who"]
    selection_oecd = selection[selection["source"] == "oecd"]
    if selection_who.shape[0] > 0:
        selection_who.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'b',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_gap.shape[0] > 0:
        selection_gap.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'g',
            ax=ax,
            label=countryname,
            legend=False,
        )
    if selection_oecd.shape[0] > 0:
        selection_oecd.plot(
            x="year",
            y="maternal_mortality_rate",
            color = 'r',
            ax=ax,
            label=countryname,
            legend=False,
        )
    ax.set_title(countryname)
    # Go to the next plot for the next loop
    plot_number = plot_number + 1
plt.tight_layout()

Turkey only has data every 8/9 years with the OECD but it’s not appropriate to fill in the gaps with the WHO data so we will drop the WHO data for Turkey.

Code

df = df[~((df['entity'].isin(['Turkey'])) & (df['source'] == 'who'))]

We’ll keep the full extent of the Gapminder data where possible as this will ensure the smoothest time-series.

Code

no_dups_df = df[~df.duplicated(subset=['entity', 'year'],keep= False)]
keep_gap = df[(df.duplicated(subset=['entity', 'year'], keep = False)) & (df.source == 'gapminder')]

df_clean = pd.concat([no_dups_df, keep_gap], ignore_index=True)

Check that there are no country-year duplicates in the final dataset and do some checks that the data doesn’t have any unexpected NAs in it.

Code

assert df_clean[(df_clean.duplicated(subset=['entity', 'year']))].shape[0] == 0
assert df_clean['entity'].isna().sum() == 0
assert df_clean['year'].isna().sum() == 0
assert df_clean['maternal_mortality_rate'].isna().sum() == 0

Some final cleaning for the main variable we are interested in.

Code

df_clean['maternal_mortality_rate'] = df_clean['maternal_mortality_rate'].astype(float)
df_clean['maternal_mortality_rate'] = round(df_clean['maternal_mortality_rate'], 2)
df_clean = df_clean[["entity", "year", "maternal_mortality_rate"]]

Add a column for each source, for comparison in grapher.

Code

who_df = who_df.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_who_2019"}).drop(columns=['source'])
oecd_df = oecd_df.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_oecd"}).drop(columns=['source'])
df_gap = df_gap.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_gapminder"}).drop(columns=['source'])
df_clean = df_clean.rename(columns={"maternal_mortality_rate": "maternal_mortality_rate_owid"})

dfs = [df_clean, who_df, oecd_df, df_gap]
df_merged = reduce(lambda left,right: pd. merge(left,right,on=['entity', 'year'],how='outer'), dfs)
df_merged = df_merged[~df_merged['entity'].isin(entities_to_drop)]

df_merged.to_csv("data/output/maternal_mortality_rate.csv", index=False)