How Prognos is Using COVID-19 Data

Women In Data's NYC chapter had a virtual event centering around how COVID-19 data is being utilized amidst the global pandemic. Kait Arnold, an engineer at VSP Global, and I spoke at this event on May 20, 2020. Kait's focus was data visualization, while I spoke about how clinical laboratory data is analyzed using transformation techniques. The purpose of this event subject was to educate the audience on how companies are working with coronavirus data to better understand the virus and it's spread. The goal of every Women In Data event is to have a welcoming environment for women to network, learn and grow in their careers. So in doing this presentation, not only do I want to educate on the topic but also would love to provide the audience with knowledge needed to do their own COVID analytics, or inspire them to break into a data-related career. Even just one person reaching out to me wanting to know more about the data world, the healthcare domain, and/or programming is a big win for me!

The event was structured well and flowed nicely, as I introduced what laboratory data is and how to ingest and transform it, and then Kait went into how to present that type of data effectively. 

I started by introducing the company I work for, Prognos Health, in order to provide the audience with more context on why and how clinical laboratory data analytics are done. Essentially, Prognos offers data deliverables that are tailored to the client's needs. We can take in messy clinical laboratory data and use sophisticated algorithms and data cleaning techniques to send out that data in a way that is easy to interpret. 

What is clinical laboratory data though? If you have had certain medical testing performed such as a blood test or biopsy, you may have noticed that you receive a report with your results. This is what I am referring to when I talk about clinical laboratory data.  You may have also noticed that the report you receive differs among the labs. Not only does it look different on the reports, but the way it is stored electronically by the laboratory varies as well. The file formats, file layouts, and the way they report results all differ from one another. For example, one lab may send us data in CSV format with columns A-Z and Hemoglobin reported in g/dL. Another may send us those same results for the same patient in HL7 format with columns A-E and Hemoglobin reported in mmol/L. We need to make sure we ingest all of the information correctly no matter what the file format or layout is, and make unit conversions so our data is consistent across the board. 
On top of that, the data needs to be cleaned, the ICD codes need to be standardized, missing data needs to be added if possible, results may need to be converted from numbers to strings, just to name a few. A lot of people can not work effectively with laboratory data because as you can see, it is very convoluted and requires a lot of resources to do on a large scale. 

To complicate things further, the transfer of healthcare data involves many laws and regulations due to the sensitive patient information it contains. Files must be sent from the labs using a secure file transfer method that is HIPAA compliant, and the patient data is de-identified and often encrypted. You can view a comprehensive overview of the HIPAA laws here. You can see that the rules and regulations outlined by HIPAA are very detailed and complicated. This complexity is another factor that may deter companies from analyzing healthcare data, because if they don't follow the laws entirely, there are harsh repercussions. 

When it comes to coronavirus data, the processes are the same as outlined above, however our goal is to filter the data on patients we have identified as having had a COVID-19 test, and then looking at their clinical history to see if they have any risk factors. Risk factors are defined as conditions that may predispose one to have a more negative health outcome when contracting a disease. The CDC has come up with a comprehensive list of risk factors for COVID-19. We look back at risk factors for all patients who were tested, regardless of whether or not they tested positive. All of our in-house data is sourced from laboratories, so by doing analytics on historical lab data for each patient, we can determine what risk factors a patient has. The end product is a report with patients who were tested, their test result (positive or negative for COVID-19) and all of their existing risk factors.

With this report, the client can do a lot with the data. Some examples I thought of are as follows:
1. How do COVID-19 results vary with age?
2. Which geographical areas are testing the most?
3. Which co-morbidities result in the highest death rates from COVID-19?

Some of these questions can be answered solely with the report Prognos generates, while others may require additional resources. For example, Prognos may not have information on who died of COVID-19, so question number 3 would require the client to map the data from Prognos to another source. Regardless, the client is answering important questions and making great progress in their goal of understanding COVID-19 data trends. 

There is no doubt that coronavirus data is playing a huge role in slowing the spread of the disease, as it is driving government decisions on what precautionary steps must be taken and when to re-open states. If more people can use this pandemic to open their eyes to the value of healthcare data, maybe we can make progress towards more clean and standardized electronic processes. The more people we have using this data, the better prepared we are to improve lives in general, and prevent deaths in a future health crisis.