Unstructured Data
Transform Rich Clinical Notes Into Integrated Insights
- Physician Sentiment and Prescribing Behavior: Analyze HCP ordering patterns and prescribing decisions, including the sentiment behind treatment choices, to better understand provider preferences and clinical rationale.
- Comprehensive Disease Profiling: Go beyond diagnoses to track disease progression using clinical data such as tumor stage, size, and histology for more accurate patient segmentation.
- Detailed Treatment History Insights: Uncover critical treatment details including dose frequency, therapy duration, and reasons for switching to inform clinical strategy and market access planning.
- Holistic Patient Journey Analysis: Correlate treatment patterns with HCP behavior, sentiment, payer coverage, site-of-care trends, and other factors to reveal real-world insights.
Transform EMR Data into Smarter Decisions
What Our Clients Are Saying
"For rare diseases, traditional labs and claims data alone haven’t given us the depth we need to act. Norstella’s EMR data is changing that. The level of clarity is redefining how we go to market."
Director, Commercial and Medical IT, Rare Disease
Global Biotech Company
How We Can Help
We unlock the insights hidden in clinical notes and EMR data to reveal the full patient story. This enables your team to act with greater clarity, precision, and confidence.
Unstructured Data unlocks insights hidden in clinical notes by transforming free text EMR information into structured, clinically validated intelligence that reveals meaningful patterns traditional claims and structured data often miss.
Unstructured Data enhances understanding of the patient journey by surfacing real world clinical context—such as physician rationale, symptom progression, lab driven decisions, and care pathway triggers—that shape treatment choices.
Unstructured Data strengthens strategy by converting complex clinical narratives into actionable provider level insights that support market access planning, competitive differentiation, and more precise patient finding.
What Makes Us Different?
Data
Through NorstellaLinQ, we have unparalleled access to rich, raw electronic medical records data, spanning 50 million U.S. lives, 2.5 billion clinical notes, more than 40 health systems and 600 hospitals, with NPI data linked back to each facility.
Uncover Site of Care Challenges
NorstellaLinQ combines cutting-edge artificial intelligence, natural language processing and large language model technologies to transform billions of clinical notes into structured, analyzable datasets.
Understand Prescribing Behavior
Our clinical, technology, data and pharmacy experts form a cross-functional team to review and validate every AI-generated output to ensure quality, accuracy, and trust.
Transform Patient Care With Integrated Lab and EMR Data
Revolutionize patient care by combining structured lab data with unstructured EMR notes.
Lab Data
- Identifies patient with specific biomarkers
- Enables precision medicine and targeted therapy
Enhanced Insights With EMR Notes
- Tracks tumor sizing trends over time
- Reveals when to switch therapies
- Provides actionable insights for timely interventions
This integration empowers HCP's to make proactive, informed decisions—positioning your productwhere it matters most and improving patient outcomes.
Key Insights From MMIT
Finding Hidden Patient Populations With Unstructured Data
This article was originally published on Norstella’s website.
As real-world data (RWD) evolves, the rise of unstructured data—from clinical notes to lab reports—is transforming how life sciences teams identify, understand, and engage with patients and providers. These data sources offer rich, contextual insights into patient experiences, disease progression, and physician decision-making that traditional structured datasets often miss.
At a recent industry event, I sat down with Joanne Tsai, director and team lead for Oncology at Pfizer, Lance Wolkenbrod, senior principal, RWD Solutions at Norstella, and Madeline Naylor, chief clinician, RWD at Norstella. We spoke about how unstructured data is helping to uncover hidden patient populations, enhance physician targeting, and bridge the gap between commercial strategy and clinical reality.
Q: How has the use of unstructured data evolved in the past few years, and why is it so important for understanding patient populations?
Lance Wolkenbrod: For a long time, we relied solely on structured data—ICD-10 codes, CPT codes, and drug treatment codes—to understand patients. That information was useful for tracking reimbursement, but it didn’t fully reflect what was happening in the clinic.
As electronic medical records became more robust, we gained access to unstructured clinical notes—what physicians actually write about their patients. These notes reveal the physician’s thought process, patient symptoms, and treatment rationale. Mining that kind of data used to be incredibly difficult, but with advances in AI and natural language processing (NLP), we can now extract meaningful insights that simply weren’t accessible before.
Q: Joanne, how is Pfizer using unstructured data today to uncover patient populations and support commercial efforts?
Joanne Tsai: In my role, I sit within the commercial group, working closely with brand leads and sales leadership. We don’t access the raw unstructured notes directly, but we rely on Norstella’s analytics teams to derive key insights from them, such as biomarker status. For example, identifying whether a patient is BRAF-positive or ALK-positive helps us refine our market share estimates and performance tracking, since this information doesn’t exist in claims data.
We also use unstructured data to generate lab alerts for our field teams. These alerts help identify physicians who are treating patients who’ve tested positive for specific biomarkers—so our reps can engage at the right moment.
For rare or low-incidence conditions, this kind of signal is incredibly valuable. In some of our oncology brands, only 3–5% of patients meet certain criteria, so finding those patients through traditional data alone is nearly impossible. Field teams see these alerts as actionable leads that improve engagement with physicians and ultimately help patients access the right therapies sooner.
Q: From a clinical perspective, what makes unstructured data so powerful for identifying hidden patients?
Madeline Naylor: Unstructured data provides the clinical richness that structured data lacks. When a physician dictates a clinical note, they include details you’d never see in claims, such as symptoms, imaging findings, lab values, even over-the-counter medications. In one paragraph of a note, you might capture the entire clinical picture: what the patient presented with, how the doctor assessed them, what treatments were initiated, and when follow-ups are scheduled.
For conditions like oncology or neurology, that context is invaluable. If we can see how and when patients are presenting—before a formal diagnosis—we can identify patterns earlier and help teams reach those patients sooner. That’s the power of unstructured data: it fills in the gaps of the patient journey, helping us move from just understanding what is happening to understanding why it’s happening.
Q: What are some of the main challenges of working with unstructured data?
Madeline Naylor:  The biggest challenge is scale and transformation. We have more than 600 different note types within our EMR data—everything from prior authorization notes to genetic counseling, nurse consults, imaging, and biopsy reports. That’s an incredible asset, but it only becomes useful when we can structure and interpret it effectively.
We’ve made great strides with AI and NLP to translate that unstructured data into actionable insights, but it’s still an evolving process. It requires constant iteration, close collaboration with clients, and continuous learning about what works best for different use cases.
Joanne Tsai: From a commercial perspective, data literacy is another big challenge. Many teams are comfortable with claims data but unfamiliar with EMR data, especially unstructured sources. It takes time and investment to educate teams on what this data can do and how to interpret it responsibly. It also takes time to build the right NLP models—our team spent six months developing a reliable market share model based on unstructured data. The results were worth it, but it’s important to set expectations around the learning curve.
Q: What new opportunities does unstructured data open up beyond patient identification?
Madeline Naylor: We’re now using unstructured data not just to find patients, but to understand the why behind physician behaviors. Why are doctors prescribing one drug over another? Why aren’t certain physicians referring patients to clinical trials? When we analyze notes, we uncover reasons—like perceived side effects or access barriers—that weren’t visible before.
This insight also allows us to develop physician profiles and referral networks. We can map who’s referring patients into trials, who’s prescribing early, and how care teams are interconnected. That understanding supports both clinical trial recruitment and commercial targeting strategies.
Joanne Tsai: Exactly. On our side, we’re using this kind of insight to refine dynamic targeting. For example, by sequencing patient encounters, we can identify not just the prescribing physician, but everyone involved in that patient’s care—NPs, PAs, or other specialists who influence treatment decisions. That helps our field teams reach the full network of decision-makers, not just the primary prescriber.
Q: Looking ahead, how do you see unstructured data shaping the future of RWD and commercial strategy?
Lance Wolkenbrod: We’re at a turning point where unstructured data is helping us move from retrospective analysis to real-time insights. As models become faster and more precise, we can use these signals to identify emerging patient populations, optimize engagement strategies, and support better access to therapies. The goal isn’t just more data—it’s more meaningful data that drives better outcomes for patients and smarter decisions for the industry.
Learn more about how MMIT and Norstella transform physician clinical notes, specialty consultations and other unstructured EMR notes into longitudinal, analytic-ready data.
Mining the Hidden Gems in Unstructured EMR Data
This article was originally published in BioPharma Dive.Â
Pharma companies are increasingly turning to real-world data to answer their commercial business questions, but not all realize that unstructured EMR data is the unsung hero of most queries. Whether a manufacturer is struggling to find a niche patient population, conduct unbiased outcomes research, or generate persuasive proof points, unstructured data can fill in the gaps left by other real-world data sources.
Until recently, this valuable information has been virtually impossible to analyze at scale. Much of the patient data contained in EMR systems—like a patient’s demographic information, vitals and procedural history—adheres to a defined format, which makes analysis feasible. But the qualitative information recorded by a patient’s care team, such as clinical notes, radiology reports and discharge summaries, is stored in free-text fields.
For years, the complexity of turning this data into insights meant manufacturers were unable to see the complete patient journey. But why is this data so pivotal in the first place?
Consider Sarah, a grandmother recently diagnosed with stage 3 breast cancer. While medical claims and lab results reveal glimpses of Sarah’s treatment journey, the richest details about her care—her tumor size, biomarker levels, diagnostic notes, symptoms and physician sentiment—are buried in her electronic medical records.
For the pharma company whose therapy is designed to treat Sarah’s tumor, this data is essential for finding Sarah and others like her: a highly specific subset of post-lumpectomy, stage 3 patients with both ER positive and HER2 negative tumors less than 3 cm in size. Without the ability to parse this unstructured EMR data, the manufacturer will never be able to find Sarah in time to impact her treatment—or improve her outcomes.
Precise Patient Identification and HCP Targeting
With the evolution of AI and natural language processing (NLP), manufacturers can now comb through unstructured clinical data for any combination of terms, clinical scores or test results. A patient’s unstructured EMR data might include a physician’s observations about their family history, potential diagnoses, or nuances of their clinical progression. These tidbits are the missing puzzle pieces that help commercial and HEOR teams understand the full picture of patient’s care.
The oncology manufacturer in our scenario, for example, can now segment its starting cohort of patients based on where they are in their treatment journey. To find eligible patients, this manufacturer needs to know which patients have undergone genetic variant testing, and what those results indicate.
Using integrated lab, claims, and EMR data, the manufacturer can pull in test results and deploy NLP on the unstructured EMR data to see variant results of interest. While traditional datasets may only indicate if a biomarker is positive or negative, EMR data can return nuances like high, moderate or low expression levels.
By analyzing these real-world datasets in tandem, the manufacturer can pinpoint patients with the right metastatic diagnostic codes and exclude those with the wrong codes. Unstructured EMR data further narrows the focus to the patient’s tumor biology, returning patients with both the right variant and mutation status to be eligible for the manufacturer’s therapy.
Instead of a million potential patients, the manufacturer now knows exactly which patients can benefit from their treatment—and who their prescribing providers are. The pharma company’s sales reps can contact Sarah’s primary treating physician within the window of opportunity, ensuring that she is able to benefit from their targeted therapy.
Customized Analytics for Addressable Patient Populations
Leveraging unstructured EMR data in combination with other real-world datasets can help pharma companies create a bespoke data asset for a myriad of use cases. After developing a clinical algorithm to surface the right patient information from as many interconnected datasets as necessary, a manufacturer can use this asset on an ongoing basis, moving both forward and backward in the data.
For example, this data can help identify the right patients at exactly the right time: after a positive biopsy, variant testing and tumor grading is completed and before therapy begins. A manufacturer can track patients as they approach the treatment decision, using weekly trigger files to alert their sales reps to reach out to the prescribing physician before that decision has been made.
Pharma companies can also capitalize on the longitudinal nature of this data to run a comprehensive HEOR study that reviews historical patient outcomes. They could create a compelling value story for a new brand by leveraging suboptimal outcomes data for patients treated with competitor products. The data could also reveal trends in physician treatments, referral patterns or unmet needs that might inform future development priorities.
By bridging unstructured EMR data with open and closed medical claims, reference and hospital lab tests and results, and structured EMR data, pharma companies can create a tailor-made data asset that will be instrumental for a wide variety of use cases. Studying the nuances of this longitudinal data can help a manufacturer’s HEOR and commercial teams efficiently map the patient journey, identify barriers and ease access to their therapies.
Learn how MMIT’s longitudinal, analytic-ready unstructured data can help your team identify patients and track disease progression.
How Unstructured EMR Data Helps Pharma Find Patients
As therapies have become more complex, pharma companies are now challenged to achieve precision targeting within a much tighter timeframe. While claims data is readily available, one of its key limitations is the lack of timeliness.
Many manufacturers now rely on specialized lab data—from imaging results to genetic testing and genomics—to identify eligible patients and their providers. As lab data is often the key driver in diagnostic decisions, this is an excellent source for commercial targeting initiatives. But what about understanding the intent behind the testing?
Unlike lab data, EMR data provides insights into physician sentiment as well as the patient’s care journey. In fact, unstructured EMR data contains the richest recorded details about a patient’s care, from biomarker levels and tumor specifics to the reasoning behind a treatment plan.
To learn more about this underappreciated data source, I spoke to Ilan Behm, vice president of Real-World Data Engagement at Norstella.
Q: What exactly is unstructured EMR data?Â
A: When a patient has an office visit, their doctor uses drop-down menus to add specific values to their chart, things like vital signs and diagnostic codes. We call that structured EMR data. The unstructured data is basically all the rest of the information that contextualizes the visit: why the patient has come in, what was done during past appointments, and what the plan is for the future.
Unstructured data is captured when a doctor records and transcribes their clinical notes, or when they write free text directly into the patient’s chart. This type of data is prone to typos and redundancies, and the wording varies quite a bit from physician to physician. However, these fields are often the only place you can find the richest information about a patient, like their biomarker levels or tumor size. Â
Q: How can that rich unstructured data be made searchable and usable?Â
A: Depending on the case, we might first deploy large language models and natural language processing to search these clinical notes for specific keywords of interest. Standardized data science techniques help us confirm that these keywords mean what we think they mean, and that they’re not leading us to a false positive. For example, there’s a huge difference between a past diagnosis of breast cancer and a family history of breast cancer.
After extracting this information on a note and patient level, we gather data on when the keyword was used and in what context: is there a date? Which encounter was this note from? At which health system or office did this visit occur? This allows us to relate the note to other EMR data points, like which doctors were involved in that visit, what medications were prescribed during the encounter, etc.Â
Q: Why is unstructured EMR data particularly useful within oncology?Â
A: If we look at only structured data from the EMR, we can see what kind of tumor a patient has. We may even be able to see if the cancer has metastasized, and if the patient has a secondary cancer. But most of the targeted therapies available today have to be deployed at specific stages of the tumor. All of the pivotal information manufacturers need—details of the patient’s cancer staging, disease progression, and tumor biology—is recorded in the clinical notes. That’s what makes unstructured EMR data so essential.
On top of that, most of these targeted oncology therapies are also associated with specific biomarkers. If a therapy is only applicable to a tiny subset of the patient population, let’s say 10% of the stage 3 colorectal cancer patients who’ve tested positive for a certain biomarker, then the race is on to find those patients. Time is of the essence in cases like these: life science companies need to identify those eligible patients and their treating physicians as quickly as possible, because lives depend upon it.Â
Q: It seems like timeliness would be a big driver within rare disease as well. Can you speak to that space? Â
A: Yes, absolutely. Rare diseases are hard to diagnose, and we’re talking about very small patient populations. There may only be one or two treatments available, so it’s all the more important to get to those patients and their physicians as quickly as possible. Their providers may not even know what this condition is, nor how to treat it. Pharma companies must be fast and strategic abound finding and educating treating physicians in time.
You know, more and more rare diseases are being identified now, but the diseases themselves aren’t new—they were just previously unnamed, which naturally means they didn’t have an ICD-10 code. To find an undiagnosed patient population in rare disease, you might need to search unstructured EMR data for patients who experienced a handful of different symptoms, which occurred in a certain pattern within a specific timeframe, in such a way that suggests they could have this newly identified condition.
Of course, you need longitudinal data to do that, to see years into a patient’s past history. In order to see the totality of events that patients are experiencing, you really need all of these real-world datasets in tandem—both structured and unstructured EMR data, open and closed claims data, lab tests and results. Integrating all of it together is really the only way to see that total patient journey.Â
Q: So how can this unstructured EMR data be bridged to other datasets?Â
A: Typically, if a manufacturer already has claims data from Vendor A, and they want to use a supplemental dataset, they’ll pay Vendor B for a data pull or a subscription to acquire lab data, or EMR data. But then, they would also have to pay a third-party company to bridge those two datasets and link their respective patient IDs. Not all of the patients would match, so they’d lose some of the patient files in the process. And they’d also have to do expert determination, which might require a fourth vendor to ensure patients remain unidentifiable after the data is linked.
Basically, there’s always a loss of data fidelity associated with this process. It’s also expensive and time-consuming. If a manufacturer wanted to use this data on a weekly basis, they might choose to pay just once for data tokenization and harmonization, but they would still have to keep bridging files every week, running quality control, and so on and so forth. It’s not a very sustainable process.
That’s why NorstellaLinQ is a real game-changer, because we’ve already integrated our real-world datasets. We use the same Norstella patient ID across our data, whether that data originated from open claims, closed claims, lab tests, vital tables, wherever. It allows us to tie the information contained in unstructured clinical notes to the rest of the patient’s journey, so we can see the full picture of how their care and disease has progressed over time.
We eliminate the time and expense cost of harmonization, as well as the loss of data fidelity. Our clients don’t have to wait; they can just start running analytics right away.Â
Q: Tell me more about the data science techniques used to validate this data.Â
A: When working with this data, you don’t just use one AI, machine learning, or large language model to find your results. You’re looking for consensus between multiple models, which helps to ensure that the end results are an accurate representation. For example, let me explain how AI and data science can be used in a predictive manner, to extrapolate our initial findings to a broader population.
In type I diabetes, one of the key measures our clients look for is unstructured EMR data is islet, or anti-, autoantibody testing, a blood test which can be used to diagnose type I diabetes—or to determine if the patient has type II diabetes. The lab data reveals that the autoantibody testing is occurring, but we don’t know the sentiment behind why the test was ordered until we look in the clinical notes in the EMR.
By studying all the characteristics of the patient, physician, and the sequencing of ordered tests, we can confirm which patterns indicate that a physician is ordering this test to confirm type I diabetes. We can then apply that pattern, via a machine learning model, to the broader patient population, for patients where we don’t have access to their physicians’ unstructured clinical notes.
In this way, we can predict with confidence that a particular subset of early autoantibody testing was performed because the physicians suspected type I diabetes. By knowing a physician’s intention, we can help life science companies understand who the experts are in a given field, and which HCPs and HCOs might benefit from additional education and awareness from field teams.Â
Learn more about how our unstructured EMR data can help your team find eligible patients and their prescribers.
Solutions to Support Your Strategy
Enhance the value of unstructured data with complementary solutions that add deeper clinical context across patient journeys, provider behavior, and payer dynamics. Empower your team to act on richer, real world insights with greater precision and confidence.