July 20, 2022

Data cabinets in the cloud

Clinical research data is precious. The IAVI DataSpace platform seeks better ways to store, manage, and share this information securely so it can be used to advance medical research.

Michael Dumiak

IAVI dataspace blood sample hoempageNot long ago, Anne Kapaata, a researcher at the Uganda Virus Research Institute, published a study on cytokine concentrations among different subtypes of HIV infections along with colleagues at IAVI in Nairobi, Imperial College in London, and Makerere University.

The researchers did their analysis using samples from Protocol C, a long-running epidemiological study that started in 2006 and was conducted by several research organizations in Kenya, Rwanda, Uganda, South Africa, and Zambia, in partnership with IAVI and the U.S. Agency for International Development (USAID). The Protocol C cohort of 600 volunteers with newly diagnosed HIV infection were followed prospectively in an effort that produced a rich dataset: it logged more than 9,500 clinical visits and more than 150,000 derived samples, reagents, and related information up through the time the project closed its data collection efforts in 2020. This has been fruitful for many areas of research — there are now more than 100 research articles that cite Protocol C sampling.

The vast collection of Protocol C data that Kapaata and her colleagues used resides on a searchable dedicated software platform known as IAVI DataSpace, which provided data visualization tools and aggregate data. This open-access portal was first launched by IAVI in 2020 with partners at Imperial College and support from USAID. Researchers could apply individually for further access to analyze pseudonymized participant-level data and samples.

IAVI DataSpace is now set to move to a more wide-ranging, virtual environment that can securely manage large data fields of potentially sensitive information, while complying with privacy regulations. With this move, there will be room to expand and bring in other datasets: both those collected in the past, and others that will become available in the future.

“Think of it like a wardrobe. Or a chest of drawers,” says Olga Leonova, an IAVI bioinformatics project manager in London. “There are lots of drawers.” And because of the willingness of volunteers to offer such valuable items to fill these drawers, researchers will be able to better investigate HIV, coronaviruses, and other infectious pathogens.

Ever since the DataSpace portal launched, work has been underway to set it in what’s called a Trusted Research Environment, a virtual site that will support the DataSpace platform. This digital space has been broadened to support other datasets and platforms as well as virtual workspaces for data analysis. “It is now a fully-fledged research environment,” says project leader Manjinder Sandhu, population health and data scientist at Imperial College London’s School of Public Health.

A trusted research environment, in software terms, is a cloud-based network for data hosting and analysis with specific features addressing the thicket of privacy issues that accompany research analyzing something as personal as a blood sample. The trusted research environment sets up a virtual platform where an organization can control its suite of data assets and who has access to them, Sandhu says. A researcher with access can analyze data within the environment, but the dataset is safe. “I’ll know it is securely hosted, it can never leave, it can never be downloaded or snapshotted,” he says. This is important for compliance with strict privacy regimes like the General Data Protection Regulation (GDPR), the European Union benchmark for data privacy management. The GDPR makes it incumbent on data managers to comply with its complex strictures.

Working with Amazon Web Services, the IAVI DataSpace team laid in mechanisms to trace how, why, and by whom the host data is being used. This means research covenants can be built in and different levels of permissions set for the way data elements are moved for analysis, all while keeping the dataset on one system, with subsets available through more restricted access.

IAVI has population-level Protocol C data available to explore on its site for anyone. Researchers like Kapaata were able to gain access to more granular data within the DataSpace portal that debuted in 2020. But as Protocol C data is moved to the trusted research environment — a job that’s close to finished — it will reside there in a dedicated space alongside other datasets, such as that from Protocol G, another large and fruitful USAID-supported cohort study that was launched in 2006 to identify HIV-specific broadly neutralizing antibodies to inform vaccine research.

Soon they will also be adding the dataset from REACT, the Imperial College London’s real-time assessment of community transmission research program that monitors COVID-19 infection data.

Each new dataset that is loaded onto the platform must meet the same stringent privacy standards: all data being opened for analysis must be supported by the proper consent forms and legal agreements set among partner research organizations. That makes it easier for data managers to sort out permissions for the end users wanting to work with the data. It also allows users to take advantage of a virtual workbench for crunching the data directly in the cloud as if it were on a laptop, a new tool built with the help of Amazon Web Services.

This DataSpace virtual environment has evolved into a kind of software template. That template can now be distributed to research institutions for them to manage and control their own datasets. It’s a priority of IAVI DataSpace to get this software to African institutions so that data collected there can be hosted on the continent. Sandhu hopes this will take place in the next year. One of the several IAVI clinical research center partners located in five sub-Saharan African nations will be a hosting institution. Sandhu says it’s a logical step to have local organizations take the lead in the management and access of data generated within Africa.

Creating new ways to access these robust datasets is already making a difference. Kapaata’s cytokine research is part of broader work for a doctorate for which she is analyzing the Ugandan population in the Protocol C cohort. The DataSpace link was a key. “It’s not the normal way of doing things,” she says. “It provided me with the data so I didn’t have to go to the lab myself; all I had to do was analyze and write up. We need more of these spaces.”

Michael Dumiak, based in Berlin, reports on global science, public health, and technology.


Read more about IAVI’s landmark epidemiology studies Protocol C and Protocol G.