Construction and validation of a covariate-based model for district-level estimation of excess deaths due to COVID-19 in India

Background Different statistical approaches for estimating excess deaths due to coronavirus disease 2019 (COVID-19) pandemic have led to varying estimates. In this study, we developed and validated a covariate-based model (CBM) with imputation for prediction of district-level excess deaths in India. Methods We used data extracted from deaths registered under the Civil Registration System for 2015–19 for 684 of 713 districts in India to estimate expected deaths for 2020 through a negative binomial regression model (NBRM) and to calculate excess observed deaths. Specifically, we used 15 covariates across four domains (state, health system, population, COVID-19) in a zero inflated NBRM to identify covariates significantly (P < 0.05) associated with excess deaths estimate in 460 districts. We then validated this CBM in 140 districts by comparing predicted and estimated excess. For 84 districts with missing covariates, we validated the imputation with CBM by comparing estimated with predicted excess deaths. We imputed covariate data to predict excess deaths for 29 districts which did not have data on deaths. Results The share of elderly and urban population, the under-five mortality rate, prevalence of diabetes, and bed availability were significantly associated with estimated excess deaths and were used for CBM. The mean of the CBM-predicted excess deaths per district (x̄ = 989, standard deviation (SD) = 1588) was not significantly different from the estimated one (x̄ = 1448, SD = 3062) (P = 0.25). The estimated excess deaths (n = 67 540; 95% confidence interval (CI) = 35 431, 99 648) were similar to the predicted excess death (n = 64 570; 95% CI = 54 140, 75 000) by CBM with imputation. The total national estimate of excess deaths for all 713 districts was 794 989 (95% CI = 664 895, 925 082). Conclusions A CBM with imputation can be used to predict excess deaths in an appropriate context.

Many global and national estimates of excess all-cause mortality for COVID-19 have been published using different statistical approaches and models [7,8], each with its own strengths and limitations [9].The Institute of Health Metrics and Evaluation (IHME), using an ensemble of six models, estimated 18.2 million deaths (95% uncertainty interval (UI) = 17.1, 19.6) due to COVID-19 compared to the reported 5.94 million during January 2020 to December 2021 [10].The World Health Organization (WHO), using a Bayesian Poisson framework, estimated 14.83 (UI = 13.23,16.58) million excess deaths globally, 2.74 times more deaths than the 5.42 million reported as due to COVID-19 for 2020 and 2021 [11].These different and non-overlapping estimates raise doubts about their veracity, leading to disputes between researchers, governments, and the general population, as the measure has been interpreted as an indirect indicator of effective management of the pandemic.
Estimating excess deaths is statistically straightforward for populations that have fully functional mortality surveillance systems; the key challenge is doing so in countries with partial or complete lack of data.The WHO report indicated that monthly national data were available for only 100 countries (52%), with other countries having only annual, subnational, or no data, especially among those in Africa and Asia [11].One of the standard approaches in these situations is to extrapolate using appropriately chosen covariates that are associated with excess death estimates at population level or use data from another country believed to have 'similar' characteristics.Both these approaches have limitations and would need validation if the generated estimates were to be credible [12].
In India, the Civil Registration System provides annual district-level data on registered deaths.However, even though death registration is legally mandatory in India, a comparison with the National Family Health Survey data for 2019-21 showed that only 70.8% of deaths were truly registered [13].According to the data obtained through the medical certification of causes of death in 2020 (which covered 22.5% of all registered deaths), 160 618 deaths were attributable to COVID-19 in India [14].Consequently, data availability challenges for estimating COVID-19-related mortality in India have been well documented [15].
The WHO estimated that there were 4.74 million excess deaths due to COVID-19 in India, 10 times more than what was officially reported in India (0.48 million) for 2020-21 [11].Both the IHME and the WHO used subnational data of 12 states for estimation due to issues with data availability, which they then extrapolated to the whole country using covariates [10,11].However, the Indian government contested the WHO estimate on the grounds of availability and access to mortality data and methods of extrapolation.Undoubtedly, there are wide within-country variations in excess deaths depending on the population profile, health system response attributes, and the pandemic severity, which can raise doubts about the extrapolation [16].In fact, making estimates for countries without data are a challenge worldwide.
Many studies [17][18][19][20][21][22][23] reported population-level covariates for predicting excess deaths in a community.Some of the factors that have been found to be significantly associated include urbanicity; population density; proportion of elderly in the population; presence of specific comorbidities; health system and governance indicators like human resource or bed density; coverage of public health interventions; and indicators of severity of the pandemic.A district-level study of nationally surveyed deaths from India showed that levels of poverty, education, public health access, urban location and other correlates of mortality accounted for one-fifth of differences in age-standardised mortality rates at ages 15-69 years in men and for one-third in women [24].In a previous study, we found that civil registration data from one district in India was appropriate for excess death estimation [25].The choice of a district as unit of analysis was due to its size (about 2 million) and availability of most administrative data at this level.
We tested the hypothesis that, in the Indian context, a combination of covariate-based model (CBM) and an imputation approach for the missing covariates can be used to predict excess deaths in districts where mortality data are lacking.

METHODS
For our analysis, we required data from 2015 to 2019 (minimum of three years) in addition to the data for 2020 and covariate data for all the districts of India.Moreover, while India had 741 districts in 2020 (reference year), we used data from 713 districts as per the year 2015 in this analysis.We combined districts that were split subsequently in the analysis.We extracted data on reported deaths from 2015 to 2020 from the Civil Registration System report for the respective years [26].
We identified a set of covariates based on the literature and likely availability of data at district level.The final list had three state level covariates (state health sector performance score, under-five mortality rate, and population above 60 years), as these data were available only at the state level.The district level covariates covered three domains -the population risk profile (percentage of urban population, density of population per km 2 , prevalence (%) of diabetes, percentage of disadvantaged population-proportion of Schedule caste and schedule tribes, percentage of population living below poverty line); health system capacity or performance (percentage of antenatal care coverage, hospital bed availability per 10 000 population, district hospital score); and pandemic severity (reported number of COVID-19 positive cases and deaths for 2020).We used the district population for 2020 and percentage of deaths registered for 2016-18 as an offset for the analysis.Here we estimated the population for 2020 by multiplying the district level decadal growth rate [27] with the population reported in the 2011 census for each district.We extracted other demographic covariates (percentage of urban population, density of population per km 2 , prevalence (%), and proportion of schedule caste and schedule tribes) and the number of hospital beds per 10 000 population from the District Census Handbook 2011 [28].Remaining covariates have been collected for 2015-19.Our sources for the covariates were the Report on Healthy States [29], the National Family Health Survey 5 [30], the Census District Handbook [28], the National Multidimensional Poverty Index [31], the Best Practices in Performance of District Hospitals report [32], and COVID-19 India.org[33] (Table 1).
Based on the availability of data on reported deaths and identified covariates, we divided the districts into four groups, whose characteristics we compared using the χ 2 test for proportions and the t test for mean (Figure 1): • Group A: Minimum three years data available for deaths for 2015-19, death data for 2020 and data on all covariates (n = 460); • Group B: Minimum three years available for deaths for 2015-19, death data for 2020 and complete data for covariates found significant in the covariate model (n = 140); • Group C: Data for excess death estimation with incomplete availability of significant covariates (n = 84).
• Group D: Data on deaths not available for 2015-20 and partially (to be imputed) or completely available data for significant covariates (n = 29).We estimated the expected annual deaths for 2020 for each of the 684 districts (29 did not have data for this analysis) with adequate death data using negative binomial regression model (NBRM) in STATA software, version 17 (StataCorp LLC, College Station, TX, USA), per the following formula: This is a standard approach used by other researchers, where the expected deaths in a given year is predicted using the margin command after the NBRM for each district [34].We calculated excess deaths by subtracting the expected deaths for 2020 (as estimated above) from the actual registered deaths.If this value was negative, excess death was taken as '0'.

Association of covariates with excess deaths
This analysis was restricted to the Group A districts (n = 460).We calculated a correlation matrix and except for strong association between the variables related to pandemic severity, (correlation coefficient (β) = 0.91), others had a β of less than 0.70 (Table S1 in the Online Supplementary Document).Because of non-normality in the outcome and the likelihood of the zeros coming from more than one source, we opted for a zero inflated negative binomial regression (ZINB) model with district population as an offset [35].We analysed the association of a dependent variable with the independent variables in four models separately for each domain.We then included the significant variables (P < 0.2) for the final model, where any variable with P < 0.05 was considered significant.We also considered other models, such as the two-part model (including binary choice and regression models), and zero inflated Poisson regression and regression.We tested the models for fitness per the Akaike information criterion (AIC), which was lowest for the ZINB model (Table S2 in the Online Supplementary Document) which was finally used.

Validation of the covariate-based model
For the Group B districts (n = 140), we predicted the excess deaths by the above CBM and compared its mean and 95% confidence interval (CI) with the previously estimated excess deaths.For Group C Districts (n = 84), we compared the estimated excess deaths with predicted excess deaths after using an imputation method for missing values of significant covariates.If a value of one of the covariates was not available, we imputed its value using a multiple imputation model (20 times imputed data) [36].For Group D districts (n = 29), we predicted excess deaths based on the validated CBM and imputation method for missing covariates.Thus, we had either a calculated or predicted estimates of excess deaths all districts which were totalled to get the national estimate of excess deaths for COVID-19 for 2020.

RESULTS
Out of the total 713 districts included in our study, 684 (96%) had a minimum of three-year data apart from 2020 data and were included for the excess death estimation.Twenty-nine districts did not have ade- quate data on death registration, so the excess deaths due to COVID-19 were predicted rather than estimated (Figure 1).
Overall, 460 districts were included in the ZINB model, which covered 15 covariates (Table 1).The district-level analysis showed that the deaths registered for 2016-18 accounted for 71.6% of the total deaths; about one-fourths were living below the poverty line; and 8.2% of the population was above 60 years of age.The average district had about two million population, 26% of which was urban, with a density of 696 people per km 2 .There were 104 beds per 10 000 population on average, while the average under-five mortality was 37 per 1000 live births.Furthermore, a district reported 15 000 cases of COVID-19 and 200 deaths on average.
We also imputed the missing covariates for 113 districts of groups C and D.. We used 84 of these districts to validate the imputation model.The estimated excess deaths (n = 67 540; 95% CI = 35 431, 99 648) based on available data were similar to the predicted excess deaths (n = 64 570; 95% CI = 54 140, 75 000) generated by CBM after imputation.The mean of the predicted excess deaths per district (x̄ = 768, SD = 547) for these 84 districts was not significantly different from the estimated one (x̄ = 804, SD = 1675) (P > 0.05) (Figure 3).
We then kept group A districts as the primary group and compared it with the other three groups.We found that group A districts were larger and had a lower share of northeastern districts, and also noted differences in share of older population, beds per 10 000 population, and share of urban population (Table 3).
We predicted the excess deaths in the 29 group D districts with imputed missing covariates using the covariate-based model; the total predicted excess deaths were 20 707 (95% CI = 12 865-28 548).Once added to the estimated excess deaths for 684 districts, there was a total of 794 989 (95% CI = 664 895, 925 082) excess deaths nationally for all 713 districts.

DISCUSSION
In this ecological study, we identified covariates that best explained the district-level excess pandemic death estimates for 2020 using data available in the public domain from nearly all of India's districts.Among all the selected covariates, urbanicity, prevalence of diabetes mellitus, and hospital bed availability were significantly associated at the district level, and the share of elderly and under-five mortality rate at the state level.The predicted estimate for selected districts using this model was comparable to the estimated excess deaths in those districts.The districts which had problems with data availability on deaths were significantly different on the variables used in the model.
Urban settings have been found to be significantly associated with both reported and excess deaths in various other studies.For example, a study of COVID-19 mortality in the Russian Federation showed greater excess mortality in urban areas [37].Tamrakar et al. [38] sought to determine district-level correlates of COVID-19 pandemic in India and found significant association of district population density and percent of urban population with COVID-19 infection ratio.This could be because the first wave in India was largely an urban phenomenon and rural areas had been relatively spared.Meanwhile, being elderly and having a comorbidity such as diabetes were found to be significant individual-level determinants of mortality [39].It is thus possible that populations with a higher share of the elderly and a higher prevalence of diabetes may also have higher mortality, as confirmed by our analysis.In a statelevel analysis of secondary data from India, noncommunicable diseases and their risk factors and epidemiological transition index were associated with reported COVID-19 cases and deaths per million population [40].In Iran, provinces with a higher level of obesity, cigarette smoking, hypertension, and diabetes mellitus had significantly higher COVID-19 death rates [41].In a municipal level analysis in Mexico, the prevalence of diabetes and obesity, the human development index, and altitude were associated with higher mortality rate [42].
Intuitively, a higher bed availability in a population denotes better access to inpatient care required for COVID-19, and therefore, a lesser mortality as well.This was seen in a study in the European Union, where the COVID-19 death rate was inversely associated with the number of available general hospitals, physicians, and nurses [43].However, our analysis showed a positive association with excess deaths.This probably is an artefact created due to cases from neighbouring districts coming to others with a better health service facility.Meanwhile, Anastase et al. [17] found no association between the resuscitation care beds and COVID-19 mortality rate in hospitals during the first and second wave.We also did not detect a significant relation between state performance index, which is not a COVID-19-specific indicator and excess deaths.Cribari-Neto [44] used the differences between the predicted and observed mortality rates as a proxy for local government's administrative efficiencies, though these were related to COVID-19-specific interventions.In Sweden, the hospitalisations and deaths due to COVID-19 were positively associated with population density and the proportion of immigrants while the proportion of the population aged >65 years showed a negative association [20].Furthermore, under-five mortality is used as an indicator of socioeconomic development in a developing country, thus reflecting the health care status and quality of life of the population [45]; its significant association with excess deaths in our study is probably an indirect association of health system performance and overall development status of the district.
Environmental factors like air pollution and access to water and sanitation facilities and altitude were also found to be associated with mortality in various studies are environmental factors like air pollution [46,47].We did not include them in our analysis due to lack of data for those variables and our use of other variables which measures the same domain, such as population risk profile and health system capacity/performance.Importantly, the final list of determinants is dependent on data availability and local context.This does raise questions on using CMB projections for populations from which the equation is not derived.However, our study did show that, despite variations in the distribution of covariates, the prediction by the model was robust and valid.Our study was based in India, which is the second largest country globally in terms of population and which encompasses widely varying settings, mirroring the diversity often seen at the global level.An average district in India has about two million people, which is larger than 170 countries in the world.Thus, any lessons from this study would have global relevance.The three major estimates available globally are those of the WHO, the IHME, and the Economist, all of which differed significantly from one another.All three analyses used multiple covariates selected from a larger set for basing their estimation in countries, without all-cause mortality data.For such estimations, one's choice of covariates depends on the scientific plausibility and the coverage of a broad set of domains while keeping their numbers small.While the domains that we have used were identical to the three prior analyses, the actual covariate may vary by context; for example, under the COVID-19-related domains, we included reported COVID-19 deaths and cases as potential covariates, but did not find them to be significant in the final model.This possibly might be because the testing strategy in India was irregular, because there were reporting issues, and because people often moved beyond districts for care or for testing.For any such model, having too many covariates would mean that the related data may also be not available in many countries, which would necessitate imputation, thereby adding one more layer of uncertainty.However, it should be cautioned that these imputations might be based on data which may not hold true during a pandemic.Also, our study clearly demonstrated that districts with better availability of data differ from districts with poor availability, which has implications for extrapolations.
Our study has strengths and limitations.The choice of covariates and the model should be an important consideration.We used a framework with four domains to identify the covariates, though our choice was limited by data availability.We used a ZINB model, as it had the best fit.Other studies have used different models, including beta regression models, while others calculated death rates rather than excess deaths, which need to be adjusted for population size.Our analyses used district-level data from 96% of the districts of India to estimate the excess deaths, with at least three years of pre-COVID-19 data available in public domain to ensure robust estimates.We acknowledge the potential for bias in covariate selection and the implications of using state-level covariates for district-level analysis.The data from the Civil Registration System has its own limitations, including varied coverage over the years and state-level differences in death registrations.Notably, some larger districts attracted patients from neighbouring districts, which was reflected as excess deaths within receiving districts; since we treated less than predicted deaths in sending districts as nil, it is possible that we overestimated the excess deaths in our study.Moreover, the non-availability of some

Figure 1 .
Figure1. District wise data availability and approach to analysis for estimation of excess deaths adopted in this study.

Figure 2 .
Figure 2. Box plot comparing estimated and predicted mean excess deaths per 100 000 population using covariate model for 140 districts.

Figure 3 .
Figure 3. Box plot comparing estimated and predicted mean excess deaths per 100 000 population using Imputation of missing covariates for 84 districts.

Table 2 .
Association of state and district level covariates with the number of excess deaths in a district in 2020 CI -confidence interval, NITI -National Institution for Transforming India, SC/ST -Schedule Caste/Schedule Tribe *Model 1 includes the covariates for which data are available at the state level.†Model 2 includes covariates depicting population risk profile of the districts.‡Model 3 includes covariates for the health system capacity/performance. §Model 4 includes covariates for the COVID-19 pandemic severity.||Combined model only includes those found significant at 0.2 level in their domain level model with offset for the district population.

Table 3 .
Comparison of covariates of districts used for different purposes in the study, presented as x̄ (SD) [51]national estimate of excess deaths for 2020 in our study was 0.79 million.The ratio this estimate and COVID-19 deaths reported by Government of India (n = 160 618) was 4.8, with the difference being explained by both unreported COVID-19 deaths and deaths indirectly attributed to COVID-19.The ratio between estimated excess mortality rate and reported COVID-19 deaths for India by IHME was 2.95 for deaths between March 2020 to May 2021[48].Our estimate is higher than that given by Jha et al. (0.49 million; lower limit: 0.46, upper limit: 0.51)[49], which was based on telephonic household survey of deaths from 1 June 2020 to 31 December 2020.Anand et al.[50]used three methods of estimation and provided the all-cause excess deaths in the range of 1.5 to 3.4 million in wave 1 (from the start of pandemic through March 2021).By comparing the pandemic period to 2019 in 12 Indian states and its extrapolation to the rest of India, Banaji et al.[51]showed a 28% increase in deaths during April 2020 to May 2021 (includes partly the larger second wave).Direct comparisons between these estimates are challenging because of different time periods.Our estimate of 0.79 million excess deaths for 2020 comprises 9.5% of about 8.1 million deaths registered in India in 2020.However, our estimate of 80 excess deaths per 100 000 population, while a substantial increase from the reported, still puts it as much lower than the global average of the cumulative excess death rate as 120 per 100 000[52].