OpenCB: an open-source big data platform for genomic data

Earlier this month, our Software Quality Manager, Pablo Marin-Garcia, was in Vienna for the European Society of Human Genetics’ 2022 conference. In addition to attending some enlightening academic sessions, he also presented a poster on some of the latest developments of OpenCB.  

What is the OpenCB? 

Simply put, OpenCB is an open-source software for analysing genomic data, started in 2012 and maintained by a community of developers. Thanks to big data technologies, it has the power to handle the demands of genomic data at scale.  

As high-throughput screening technologies have advanced, genomic data has been produced in increasing volumes. Current large-scale clinical genomics studies can use as many as thousands of whole genome sequences, which equate to terabytes of data. Bioinformaticians are now facing the challenge of how to store and analyse this volume of data, which legacy technologies can no longer support.  

This requires the creation of software and data management platforms which can handle the big data age of genomics, while delivering on performance, scalability, flexibility, robustness, accuracy and security. Hence, the creation of OpenCB.  

How can OpenCB be applied?  

Because it is an open-source project, the potential of OpenCB really is unlimited. Currently, there are three main projects developing specific functionalities:  

  1. CellBase, a MongoDB database used for querying genomic annotations 
  1. OpenCGA, a variant and clinical data store based on MongoDB and HBASE with Solr indexing 
  1. IVA, a web-based analysis client of  OpenCGA 

Deployment is based on Docker and Kubernetes under CI/CD. 

Major use cases of OpenCB’s features include population-scale and clinical interpretation analysis for rare diseases and cancer. Because of the power of the software to handle big data, queries and analysis can be done in real time. This work can be conducted under a strict data access authorization model, to maintain privacy compliance.  

You can read more about OpenCB in Zetta’s poster for ESHG 2022 below:  

To speak to a member of our team about the poster or the power of OpenCB, please email

Leanne Elske

Zetta Genomics and the creation of culture

As a new company surges into existence, it is focused on objectives that turn transformational ideas into a successful business. As Zetta Genomics has attracted the investment and partnerships to scale, our People Team is creating a talent structure that will drive innovation, engineering, productisation, marketing, sales and everything in between. We believe, however, that success depends on another and often overlooked structure: culture.

There’s a lot of talk about organisational culture – some of which can seem vague or difficult to quantify early on (which is why it’s so often greeted with an eye-roll). At Zetta, culture isn’t simply finding warm and fuzzy words and putting them on the website. Culture isn’t a box to be ticked. Culture won’t be found on a page in the annual report. 

Culture defines everything that we do at the most basic level – to be identified, codified and embedded from the very beginning. It should  be understood by, and communicated to, everyone in the team. While a company’s culture might be complex, it must be so intuitive that everyone can own, celebrate and share it. 

There are many definitions of organisational culture, but I see it as the formalisation of the shared values that permeate a company, and the shared behaviours that these values inspire. Simply: how the company treats others – and how it expects others to treat the company. 

Everything pivots on a company’s culture. It binds people, policies and products into a single purpose. Building cultures together, decisions around whether that purpose is progressive or toxic will be made by the people who work there.

Cultural cost

Toxic cultures come with a high cost. I know because I’ve worked in them. I’m sure we all have. They encourage so many of the unpleasant aspects associated with the working world: presenteeism-oriented, task-driven, innovation averse and creatively atrophied. I didn’t work in these environments for long. 

People are complex organisms – each with their own unique experiences. Given care, attention and opportunity – always coming from a place of trust – they will amaze and delight you. Treat them badly, however, and they will rightfully walk away.  

In my experience, toxic cultures are far more onerous to maintain than healthy ones. Why tie your People Team up coercing and de-escalating, when it could be nurturing and supporting achievement? Why see Communications devoting untold hours to disingenuous messaging when it could be inspiring, shaping and sharing ideas? Why crack the whip on moribund product and sales teams when they could be self-driven by people who truly believe in what they do? 

In toxic cultures, we very quickly see this turn into a waste of time, money and – most costly of all – talent. 

Cultural impact

Progressive and open cultures have a tangible impact on every aspect of a company’s performance. 

A company that embeds respect, encourages and fosters its people, listening and acting in response to feedback internally is a company that will naturally do these things externally. These cultures inspire the effective, experience-led and authentic engagements that allow businesses to build trust, enhance products and build lasting customer relationships. 

Companies that welcome honesty, don’t apportion blame and give their people the freedom to act – even when this means taking risks – will embed innovation. The confidence to surface wholly new ideas constantly refreshes existing products and pump-primes the development  pipeline.  

Progressive cultures will also have a positive impact on talent acquisition. In a post-Covid and post-Brexit world, talent in the Life Sciences and Tech spaces is at an extreme premium. The people able to harness the power of genomic data are few and in high demand. The best performers can – and do – choose the organisations with the best cultures. Those that think throwing money at cultural shortcomings will solve the problem quickly realise that a big pay cheque won’t compensate for their dysfunctional workplaces. 

Finally, the confidence and creative freedoms that these cultures inspire make for safer organisations. The airline industry, for example, has embraced the idea that knowing when mistakes are made, and their causes, is far more important than blaming individuals for what are actually systematic failures. It means that air travel is the safest mode of transport in the world. It’s an approach that healthcare systems have tried to replicate – with varying degrees of success. The freedom to succeed must also be the freedom to try and fail.  

Cultural creation 

Creating a culture takes time and a very real commitment from everyone within the company. While it might seem counter-intuitive, freedom must exist within a framework of systems and processes – transparent to everyone within the organisation – that support and protect it. Wishing a happy and functional culture into existence won’t cut it. Freedom without process isn’t freedom at all; it’s chaos. 

I spoke earlier about a choice between progress and toxicity. Zetta Genomics has very clearly opted for the former. My appointment – alongside those of Ludo Chapman as our Chair, Bryony Burrows as our CFO and others – are powerful statements of intent. Zetta is investing, at the earliest possible stage, in its people and in its culture. 

It’s why I came here. We are a new company with truly transformative genomic data technologies that will play a major role in the adoption of pervasive precision medicine – improving the lives of millions of people across the globe. Our mission matters and so our culture is the very real foundation for our ongoing and sustainable success. 

From one genome to millions: data technology to unleash the potential of precision medicine

In genomics, size really does matter: the larger the dataset, the greater the returns. In the last two decades we have moved from a single genome to datasets of over 100,000. In the next decade datasets of millions will be commonplace. It’s clear, however, that having a big dataset alone is not enough. It is our ability to query, access and interpret these treasure troves to uncover the actionable insight within that will usher in a new era of precision medicine for millions across the world.

One genome to 100,000: scale

To understand where we are in the genomics revolution, we must know where we’ve come from. In 2003, we had mapped the human genome for the first time. It took over a decade, some of the most brilliant minds on the planet and billions of dollars. In the years that followed, the number of genomes grew, giving us a tantalising glimpse of precision medicine’s potential. 

By the end of that decade genome sequencing was moving to the clinic, demonstrated in 2008 by the case of Nicholas Volker, the first child whose treatment was based on a diagnosis determined by genome sequencing. Each genome required highly specialised and extremely rare know-how, and often one-off technologies. It was time consuming, costly and delivered limited results because there was little reference information available to drive interpretation. Genomic medicine seemed too complex and expensive to scale.

In 2013, the UK government changed the game – setting up Genomics England to sequence an unprecedented (and many thought impossible) 100,000 genomes within five years. Importantly, the project wasn’t simply about research findings from large numbers of sequenced genomes (although that played a part), but the clinical utility of the data they yielded. This foundational aim was a world-first: a population level genomic medicine service. This wasn’t just a step forward – but rather a gigantic leap.

At the time the 100,000th genome was sequenced in 2018, I was working at Genomics England with my Zetta Genomics co-founder, Ignacio Medina. The goal was to turn the hitherto specialised art of genomic analysis into an industrial process using modern big data technologies. In terms of data management, this essentially meant starting from scratch.

100,000 genomes to 100 million: access, interpret and action

Why start from scratch? Most obviously, the data volumes in play simply overwhelm existing data technologies, but there’s a second more subtle and intriguing reason; precision medicine demands changes to the in vitro diagnostic (IVD) paradigm – moving from a ‘linear assay-to-result workflow’ to an ‘iterative data-driven investigation’. 

To explain, while we only need to sequence a person’s genome once, our ability to interpret and gain value from it progresses all the time. The key is to keep going back to the genome as other factors change – something existing ‘flat file’ data management systems fail to deliver. These ‘other factors’ can be categorised in two ways:

1) changes in a person’s medical condition;

2) changes to our understanding of the genome.

As an example of the first factor, in neonatal genomics the condition of an acutely unwell infant can change rapidly and so we need to equally rapidly reinterpret their genome. While the data might yield no results in week one, changes to the condition by week two might reveal a genomic match that leads to a diagnosis with the potential for earlier and more effective interventions. This dynamic, genome-driven approach stands in stark contrast to diagnostic odysseys for rare disease that can take years to resolve.

Turning to the second factor, a patient might present with a condition today and receive no genomic result because we do not yet know the connection. Yet, in a month, a year or a decade, fresh genomic discoveries might provide that diagnosis directly from their existing data without the need for any further clinical procedures.

This is, essentially, what Zetta Genomics’ XetaBase can provide: a genome-optimised, highly automated data management platform that continually interprets the genome based on the latest information. When there is a ‘hit’ following the release of a new gene panel, for example, users are automatically alerted to the potential for a genomic diagnosis. Further, and critically, it makes this information securely but easily accessible to researchers and clinicians where they can action it – whether that’s at the laboratory bench or the patient’s bedside.

XetaBase, Zetta Genomics

100 million and beyond

I am hugely proud of Zetta Genomics and XetaBase, but not purely for personal or even business reasons. While a co-founder alongside Ignacio – who is the architect of the OpenCB platform on which XetaBase is built – Zetta does not belong to us. It is the result of a truly astonishing collaboration that was born out of pioneering work from both Genomics England and the University of Cambridge – and carried forward by the open-source community.

Even now, as we scale up to meet the needs of a rapidly expanding global genomics sector, we move forward with the support of far-sighted investors, commercial partners such as Microsoft, Fujitsu, and Future Perfect Healthcare, and organisations such as the UK’s NHS and its Genomic Medicine Service. Together, we have built a solution that one customer describes as, “completely unlike anything else out there”. They will use it to transform healthcare services – particularly among economically disadvantaged and underserved communities.

Designed to scale to need, XetaBase will continue to unleash the potential of precision medicine for people across the globe – able to harness the power of millions of genomes, or even more.

Zetta Genomics gains £2.5m seed funding to realise the power of genomic data in precision medicine 

Nina Capital, APEX Medical and Cambridge Enterprise invest in transformational open-source data technology – to deliver precision medicine at scale 

University of Cambridge technology spin-out Zetta Genomics announces that is has raised £2.5 million in new seed funding. The successful round secures leading global Venture Capital (VC) investment in the company’s XetaBase platform – a transformational genomic data management technology that powers the discovery and delivery of precision medicine at scale.  

Data for the precision medicine era 

Built on the open-source OpenCB platform, co-developed by Zetta Genomics founder, Ignacio Medina, XetaBase allows researchers and clinicians to securely store, easily access and dynamically interrogate vast and increasing volumes of genomic data – on demand. The technology brings genome-enabled discovery, healthcare analytics and population research into the lab – and prognostics, diagnostics and therapeutics into the clinic. 

Zetta Genomics founder, Ignacio Medina said, “Zetta Genomics re-imagines data to deliver a dynamic platform fit for the fast-emerging, fast-scaling, multi-petabyte environment. In liberating genomic data – placing its power into the hands of researchers and clinicians – we will drive precision medicine’s transformation into mainstream healthcare and life-changing patient benefit.” 

Genomic growth  

The seed funding round comes as genome-enabled precision medicine moves from the niche to the mainstream. The UK has led the world with its new Genomics Medicine Service, making testing routine within the publicly funded NHS. These and other population level initiatives are predicted to see 60 million genomes sequenced to 2025 and 100 million by the end of the decade.  

Market growth in the genomics sector is equally impressive, with 2022 research suggesting CAGR will exceed 15% to 2026 and market value will more than double between 2020 and 2026 to US$47 billion. 

Zetta Genomics’ expertise, technologies and ongoing growth have attracted investment from Cambridge Enterprise, APEX Ventures (Vienna) and Nina Capital (Barcelona). 

Marc Subirats, Partner at lead investor, Nina Capital, said, “Genomic medicine has enjoyed explosive growth in the past five years, but this is set to be eclipsed in the next decade. XetaBase is an enabling technology – empowering virtually every research field and clinical application. As genomic sequencing moves from the hundreds of thousands to the hundreds of millions, Nina Capital is confident that Zetta Genomics’ growth will both drive and be driven by rapid advances in precision medicine.” 

XetaBase accelerator 

Market and precision medicine opportunities have helped Zetta Genomics to create an extensive, growing and valuable partnership network with organisations such as Fujitsu, Future Perfect Healthcare, Genomics England, Microsoft, the NHS and the University of Cambridge. Microsoft, for example, selected Zetta Genomics for its Reactor programme – identified as a company of high value potential – providing tailored support across its services, including the Azure cloud platform. 

Dr Elaine Loukes, Investment Director at University of Cambridge Enterprise, said, “Cambridge Enterprise creates and invests in companies, built on University of Cambridge research, that can have a huge positive impact on society. From our first meeting it was clear that Ignacio had developed something incredibly special. Zetta’s technology helps researchers and clinicians fully exploit genomic data, speeding the delivery of precision medicine across the world.” 

VC-backed funding will focus on growth, enhancing the company’s partnership network while it expands from the UK to open both Spanish and US offices. Investment will also focus on talent, with a five-fold increase in headcount to secure additional software, development and commercialisation expertise. 

Future Perfect and Zetta Genomics win contract to solve data management challenges for the East Genomics Laboratory Hub

Future Perfect Healthcare, a company dedicated to ‘making health and care maximally intelligent’, has partnered with Zetta Genomics [Zetta] to implement its XetaBase genome-optimised data store at the East Genomics Laboratory Hub [East GLH].

The partnership involves the roll-out and ongoing support and maintenance of Xetabase at the lead laboratory of the East GLH, based at Cambridge University Hospitals NHS Foundation Trust. XetaBase uses a big data technology stack based off Microsoft Azure cloud to solve many of genomic medicine’s data management challenges.

Previous genomic lab processes have been highly time consuming for bioinformaticians.

Future Perfect’s role has been to demonstrate a replicable business case, offering evidenced efficiencies in use of bioinformaticians’ time – essential to rolling out genomic testing.

Jon Reed, chief digital officer at Future Perfect comments, “Genomic medicine is an emerging discipline that uses genome-wide information for diagnostic or therapeutic decision making. It is a discipline in which, through the national Genomic Medicine Service, the NHS is a world-leader. But the volume and complexity of data generated by genomic assays creates significant challenges for clinical implementation. We are so pleased to be working in partnership with Zetta to deliver this project.”

Will Spooner, co-founder and CEO at Zetta Genomics adds, “We’re delighted to be working with our partners at Future Perfect to deploy the XetaBase system at East GLH, which will save the valuable time of bioinformaticians and enable safe and convenient access to the large amounts of complex data they need to conduct the lab’s life-changing research.”

In daily operations, the open-source, open standards XetaBase system will provide the clinical scientists at East GLH with single-point access to patient-level genotypic and phenotypic data sets at a granular level. The data is normalised across patient cohorts, and enriched using the latest clinical knowledge bases. Users are provided with self-service interfaces for real-time genotype-to-phenotype queries that support collaboration between diverse stakeholders across distributed sites. 

The underlying architecture of the system will enable enhanced research capabilities for the team at East GLH, which conducts genomic testing for patients with rare diseases and cancer across the East Midlands and East of England.

Cloud tools and international standard assurance helps protect sensitive data and keep genomic medicine pipelines in line with medical device regulations and best practice.

Commenting on the contract awarded to Future Perfect and Zetta, Joo Wook Ahn, principal bioinformatician and the project sponsor at East GLH, concludes “My team will be able to look efficiently at unparalleled data volumes from the largest genomic tests; panels, exomes, and whole genomes, whether from rare disease or oncology. This is a game-changer for our clinical services, and we are excited to see where these new data capabilities will lead in terms of the life-changing research conducted in the laboratory every day.”

Learn more at

UK Leading On Patient-Centric Precision Medicine Research

A novel system which will allow rare disease patients and their caregivers to add additional information about themselves to research databases is being developed by Sano Genetics in collaboration with Zetta Genomics and Genomics England.

The system will add an important layer of patient derived information to the groundbreaking precision medicine research being carried out through Genomics England. The information provided by individuals may be reported by participants directly, for example daily symptom tracking, or by a device such as a watch that measures activity or sleep.

“Many patients and their families are keen to work with researchers to better understand their health conditions. Tools which enable this to happen effectively could lead to exciting new discoveries” says Jillian Hastings Ward, Chair of the Genomics England Participant Panel.

The project won £450,000 grant funding from Innovate UK as part of its competition, Digital Health Technology Catalyst Round 4: Collaborative R&D.

This system is a first and further cements Genomics England’s commitment to participant involvement in research.

This new initiative will lay the groundwork for better capturing additional data directly from patients and their families in the ‘real world’ to learn about disease progression and treatment effectiveness from their perspective. Collecting information about health and wellbeing directly from patients can help fill in the blanks between infrequent doctor visits. For example, wearable devices or digital journals that allow parents of children affected by neuromuscular conditions such as Duchenne Muscular Dystrophy to record daily activity would provide much more detailed pictures of disease progression, or improvement after treatment. Development of a patient platform will also enable patients to be notified about new research opportunities that may be relevant for them, including clinical trials that test new medicines.

Sano Genetics, an SME based in Cambridge, UK has developed a platform for patient engagement in precision medicine research, and is leading the consortium effort to further develop the technology for use in population-scale genomics programmes. Zetta Genomics, an SME also based in Cambridge, brings expertise in big data analysis in genomics using OpenCB, a leading open source software for large-scale genomic data management. Genomics England is a company wholly owned by the UK Department of Health and Social Care that has been at the forefront of patient partnership in precision medicine through the delivery of the 100,000 Genomes Project and the Genomic Medicine Service with the NHS. This collaboration represents another significant step in developing technologies and processes that put the patient voice at the heart of research.

The collaboration involves two main workstreams. The first workstream involves surveys and workshops with patients to influence the features of the platform, and an ethical, regulatory, and legal working group to ensure that any proposed use-cases meet the highest standards in the United Kingdom. Two workshops were held in April with participation from more than twenty participants from the 100,000 Genomes Project and other genetics research initiatives across the UK. A survey of research participants and further workshops will be held in the coming months while the platform is under development. The second workstream is focused around technical developments of the Sano Genetics platform and OpenCB technology, including stress-testing the systems for scalability using simulated data and building capabilities for federated data analysis, whereby data can be only be analysed within ‘safe havens’ such as the Genomics England research environment.

The Innovate UK funding will allow the consortium to develop the technology, and go to market in 2021 with a patient engagement platform for population-scale genomics programmes. This collaboration has the potential to accelerate precision medicine by enabling access to real-world data and patient-reported outcomes on a population scale by making patients genuine partners in the research process.

“We are very excited to develop the Sano platform in collaboration with Zetta Genomics and Genomics England, and hope this collaboration can serve as a model to other research programmes and biobanks. This collaboration is initially focused on rare disease, where there is a huge need to develop new treatments and very dedicated patient groups and advocates. We believe our platform can also help accelerate precision medicine research in common genetic conditions and cancer, and are actively setting up collaborations in these areas as well.” – Patrick Short, CEO of Sano Genetics

“Modern digital technologies enable new paradigms for participant engagement in research. The studies enabled by this and similar initiatives are set to drive significant advances in biomedical science. Advances that ultimately translate into the new precision medicines and diagnostics that will improve patient outcomes for generations to come. We are delighted to be part of this ground-breaking project that puts the participant at its heart.” – Will Spooner of Zetta Genomics

“Collaborative efforts such as this are helping us further explore how additional data sources can be used to impact positively on healthcare, while ensuring that the highest levels of data security and integrity are met along the way.” – Augusto Rendon, Chief Bioinformatician, Genomics England