An Experiment with Open Science in India: Machine Age Tools for Understanding Economic Development

Arjun Kumar, Anshula Mehta, Chhavi Kapoor, Swati Solanki

Incorporating technology and scientific developments to improve the robust economic shape of India is more pertinent than ever. With the rise in significance of data in economic growth, diplomacy, and governance, India must advance its technological tools to stay relevant in the international arena. To share ideas on this subject, the Genalpha Data Center, Impact and Policy Research Organisation, organized a webinar on “An Experiment with Open Science in India: Machine Age Tools for Understanding Economic Development,” as part of the #DataDiscourses series.

Dr. Arjun Kumar, Director of IMPRI, introduced to the speaker of this webinar, Ms. Aditi Bhowmick, Director (India) of the Development Data Lab (DDL), an organization committed to transforming the depth of open data in developing countries. Ms. Bhowmick, a Princeton graduate, is working towards forging partnerships between the government and civil society on behalf of her organization.

The Significance of Open-access Data

Source: IMPRI #WebPolicyTalk

Ms. Aditi Bhowmick began by stating the objective of her presentation, which is to discuss the best policy approaches to design, use, and manage large-scale administrative data.

“The COVID-19 pandemic has shown the acute need to have free-flowing, reliable information for better governance,” said Ms. Bhowmick.

A common geographical frame is the single most important aspect to focus on when approaching India’s data ecosystem. This involves data sets for the economic and population census, which have limited usability when unlinked. Open data access has high potential, as DDL has found.

Ms. Bhowmick says that the DDL repository has a vast amount of data relevant to policymakers at all levels, which they could greatly benefit from if availed.

The Development Data Lab’s aim as it stands is to unlock all the benefits that could be reaped from collaborative work involving policymakers, government representatives, social scientists, civil society, and even the private sector. The organization hopes to encourage collaboration through The SHRUG, Socioeconomic High-resolution Rural-Urban Geographic Data Platform for India, an open-access dataset, and a research platform.

The theory through which open data science works is that researchers who create data and publish results could share it for public use onto a platform, which could then be used by other researchers to replicate the findings and emulate them in other contexts, or use them to justify and answer other phenomena.

‘Water everywhere but not a drop to drink’ is a fitting way to describe India’s data ecosystem, Ms. Bhowmick believes.

A lot of excellent research work is conducted but is limited to silos, as Indian social scientists do not have access to the resources or investment to make them public or to invite collaborators. Institutional incentives to aid this process are also lacking.

Governmental departments also produce vast amounts of data, but again, are limited to silos. These data sets do not interact with each other, reducing their usability. As keeping data private takes less effort, public data becomes redundant as it does not have supporting comments and work. This limits the use of available data. DDL believes that data must be a non-rivaled public good for maximum benefit.

Currently, the SHRUG is the largest open-access socioeconomic geocoded data set in the developing world. It covers over 500,000 towns and villages, covering a vast variety of socioeconomic, industrial, agricultural, and political data. The Unique Selling Point of the SHRUG is its linkability across data sets and its linkability over time. This produces a rich understanding of socioeconomic phenomena that could greatly benefit social scientists and policymakers.

“Having open access, high-quality, immediately usable data is very useful for accountability to both the government and citizens to study the impact of large-scale government schemes,” Ms. Bhowmick states.

A very important contribution of having access to this kind of data at the village level is that it makes the targeting of government welfare programs much more effective. The advantage of the availability of ready-to-use data for journalists, policymakers, social scientists, and others is unmatched.

One pertinent use case of the SHRUG was for the evaluation of the Pradhan Mantri Gram Sadak Yojna (PMGSY), under which 1,00,000 new roads were built in villages across India over fifteen years from 2000 to 2015. To study the impact of this scheme on local economies, one would need to link data sets of the population census, economic census, socio-economic caste census, satellite imagery, and administrative data produced by the PMGSY.

Without the SHRUG, a common geographic framework linking all of these data sets would have been the barrier, but since the SHRUG fills the gaps, the researchers studying this scheme were able to answer policy-relevant questions in their paper, which was published in the American Economic Review in 2020. Their study found that construction of the new roads did not affect consumption, local entrepreneurship, or agricultural productivity but it did help mobilize people to find new jobs outside their villages.

The SHRUG also provides tools to improve access to healthcare facilities. Civil servants tasked with identifying Primary Healthcare Centres that require the most help would be able to identify districts that are in need using dataset maps that provide granular data.

The COVID-19 pandemic has generated a huge amount of data which is much needed. While this data is plentiful, it is not adequately utilized because government officials do not invest time or resources to harvest this data, and because the different research and policy teams using or producing this kind of data do not have resources to combine them, they become redundant, Ms. Bhowmick concluded.

Developing Analytical Data for the Indian Scenario

Source: IMPRI #WebPolicyTalk

The first question as part of the Q&A session initiated by Dr. Kumar was regarding the documentation for how the geographic indicators were created for the SHRUG dataset. Ms. Bhowmick answered that it does exist and was published in the World Bank Economic Review. The second pertained to the kind of ground-level study that the process required and whether the team faced cooperation issues.

To this, Ms. Bhowmick answered that their process required no reinvention as the Indian government already has collected a vast amount of data that was required. Owing to this, they did not need to collect data on their own. However, the data needed to be validated across data sets and needed to be reconfigured to be ready to use. “The Development Data Lab is in conversation with several state government departments and policymakers to figure out how we can use open-data products that we have created to democratise data, and if it is possible to use this to empower representatives at the village level,” Ms. Aditi Bhowmick stated.

Sharing her opinion on whether a central data agency should be set up, Ms. Bhowmick said she believed that the Ministry of Statistics of the Indian Government already does do a pretty good job. While capacity issues do exist due to the scale of challenges to governance in India, there have been steps taken in the right direction, such as the implementation of the National Data Analytics Portal, which could be a great resource to have access to.

Dr. Kumar posed a question to Ms. Bhowmick regarding the kind of skills that must be garnered by data enthusiasts to apply analytical data, and how discrepancies can be avoided regarding data credibility. Ms. Bhowmick answered that applying data must be question-based. Being driven by the goal of the project for which data analysis is being undertaken is crucial, and viewing data empirically is a skill that is required to develop quality data.

For the Indian context, introducing technologically advancing tools without having the capacity on the ground to manage it may be a mistake, and relying on traditional surveying methods is still the way to go, Ms. Bhowmick believes. While incremental improvement is happening, we are many years away from having completely digitized data records, and this has been proven by the COVID-19 pandemic. There already exists a ton of data that still has not been leveraged and a lot of material to work with, which needs to be focused on before thinking about new ways of data collection and digitizing everything using MIS, Ms. Bhowmick says.

Sharing her thoughts on what the limitations are for a heterogeneous country like India for spatial and visualization exercises, Ms. Bhowmick shed light on how the frequent changing of geographical units are obstacles for time-series analyses. Names of villages and the districts they come under are subject to change. While some tools are employed to configure Indian phonetics, they are not easily accessible which leads to challenges for data mapping. The Centre’s Local Government Directory, which is a tool to update names of villages or the ways they are written, in order to ensure uniformity, is a step in the right direction.

Regarding the work being taken up using Niti Aayog’s National Data Analytics Platform and how the future trajectory looks, things seem to be advancing very fast, Ms. Bhowmick states. Governmental data platforms are a step in the right direction and will greatly ease the possibility of public-private partnerships for making data transparent, developing incentives, and the positive externalities that can be created. Investors must be brought on to improve data infrastructures in the country.

Expressing her opinion on the role of blockchain and AI to enable data analytics and where machine learning will be of use for the development sector, Ms. Bhowmick said that machine learning is already being used in developed countries to solve public policy challenges, by way of using the technology to allocate staff to government departments, to estimate need in real-time, and machine learning algorithms in the criminal justice system.

While this is an excellent development, we should also be very cautious of biases that could be integrated into automated programs. Data skills in India are very rich but they must be employed and harvested properly by marrying these skills with social justice and a deep socioeconomic understanding, Ms. Bhowmick concluded.

Dr. Arjun Kumar wrapped up the discussion by thanking Ms. Aditi Bhowmick for a lively and informative discussion.

Acknowledgment: Malcolm Antony is a research intern at IMPRI and is pursuing BA Hons. in Political Science from Delhi University

YouTube Video for An Experiment with Open Science in India: Machine Age Tools for Understanding Economic Development