· 7 min read
Building a Patient Data Pipeline
Extracting insights from a healthcare system's PDFs and APIs, with a detour from Go to JavaScript.
Building a Patient Data Pipeline
A client had a patient management system full of useful data, but getting that data out was a manual slog. PDFs everywhere, no easy way to aggregate information, and a director who needed regular summary reports that currently required hours of copying and pasting. They also wanted to use the historical data for research, but privacy regulations meant they couldn’t just hand raw patient records to researchers.
I built a pipeline that pulled data from their system, parsed the PDFs, generated automated reports, and de-identified records for the research database.
The Starting Point
The patient management system had an API - not a great one, but functional. It could return patient records as JSON and provide download links for PDF reports. These PDFs were the main problem. They contained clinical information, treatment notes, and outcome data in a format designed for humans to read, not machines to parse.
The reports followed a template, but not strictly. Field labels were consistent, but their positions varied. Some sections were present on some reports and absent on others. Tables had variable numbers of rows. This is typical for PDF parsing - the format preserves visual layout, not semantic structure.
Why I Started with Go (and Why I Switched)
My initial choice was Go. I’d been using it more for backend work, and the plan was to build a CLI tool that could run as a scheduled job. Go compiles to a single binary, handles concurrency well, and the type system catches errors early.
The API integration went fine. Go’s net/http is solid, and I had the JSON parsing working quickly. The problem was PDFs.
Go’s PDF library situation is limited. There are a few options, but none handled the complexity of these reports well. The text extraction was lossy - tables came out as jumbled text, layout information was lost, and I was spending more time working around library limitations than solving the actual problem.
I evaluated the options: push through with a limited library, find or write better Go bindings for a C PDF library, or switch languages. Building PDF parsing from scratch wasn’t realistic - PDF is a surprisingly complex format with multiple text encoding schemes, font embedding, and layout constructs.
I switched to JavaScript. The ecosystem is massive, and libraries like pdf-parse and pdf2json have been battle-tested on millions of documents. Within a day of the switch, I had text extraction working properly. Within two days, I had the table parsing figured out.
Was Go the “better” language? In some ways, yes. But I finished the project in JavaScript instead of fighting with PDF libraries in Go. The pragmatic choice was the right choice. I’ve written about this before - sometimes you pick the tool with the libraries you need, not the language you prefer.
Parsing the PDFs
PDF parsing is more art than science. The library gives you text content with position information, but you have to reconstruct the structure yourself. A table in a PDF isn’t marked as a table - it’s just text positioned in a grid pattern.
My approach:
- Extract all text elements with their x/y coordinates
- Group elements into lines based on y-position (with some tolerance for slight variations)
- Identify section headers by font size or weight
- For tables, detect column boundaries by looking for consistent x-positions across multiple rows
- Map the extracted values to a structured schema based on section headers and field labels
The messy part was handling inconsistencies. Reports from different periods had slightly different formats. Some used “Patient Name:”, others used “Name:”. Date formats varied. I ended up with a set of heuristics and fallbacks - try the expected format first, then try alternatives.
I also had to handle multi-page reports where tables spanned page breaks. The y-coordinates reset on each page, so I needed to track which table I was in and continue appending rows.
The result was a structured JSON object for each report: patient demographics, clinical dates, treatment information, outcome codes. This became the input for everything downstream.
Generating Reports for the Director
The director’s requirement was straightforward: a summary of recent activity, aggregated statistics, and any outliers that needed attention. Previously this involved opening dozens of PDFs and manually tallying numbers.
I built a report generator that:
- Queries the latest extracted data
- Computes aggregations: counts by treatment type, average durations, outcome distributions
- Identifies outliers: patients with unusually long wait times, incomplete records, pending follow-ups
- Formats everything into a clean HTML email with embedded tables and charts
The email goes out automatically on a schedule. No one has to remember to run it, and the director gets consistent formatting every time. If something needs investigation, they can click through to the original record.
One small touch that mattered: I included a “data freshness” timestamp showing when the source data was last synced. If the pipeline fails silently, the stale timestamp makes it obvious something’s wrong.
De-identification for Research
The organization wanted to use historical data for research studies. Australian privacy regulations (and ethical guidelines) are strict about using patient information for secondary purposes. The solution was de-identification: remove or transform anything that could identify an individual, while keeping the medically relevant information intact.
De-identification isn’t just “delete the name field.” It’s a spectrum. Direct identifiers (name, address, Medicare number) obviously need to go. But combinations of indirect identifiers can also enable re-identification. If someone is the only 87-year-old male in a particular postcode who had a specific rare procedure on a specific date, that combination might be unique enough to identify them.
For this project, I implemented:
- Removal: Names, addresses, contact details, Medicare numbers deleted entirely
- Generalization: Exact dates converted to year-quarter, ages grouped into ranges, postcodes truncated to state-level
- Pseudonymization: A consistent hash of the original patient ID, so researchers could track the same patient across records without knowing who they are
The generalization rules were based on guidance from the OAIC (Office of the Australian Information Commissioner). The goal is k-anonymity - any given combination of quasi-identifiers should match at least k individuals in the dataset.
The de-identified records went into a separate database. Access required ethics approval and was logged for audit purposes. Researchers could query aggregates and anonymized individual records, but never see the raw data.
Automating the Whole Thing
The pipeline runs as a scheduled job:
- Sync new records from the patient management API
- Download and parse any new PDFs
- Store structured data in the main database
- Run de-identification and populate the research database
- Check if it’s report day; if so, generate and email the director’s summary
I used simple file-based state tracking to avoid reprocessing the same PDFs. Each run checks what’s new since the last successful sync.
Error handling was important here. Healthcare data can’t just disappear into a failed job. If PDF parsing fails on a specific document, the system logs it, marks that record as requiring manual review, and continues with the rest. The director’s report includes a count of any records that failed processing.
What Made This Work
The biggest win was removing manual work. Tasks that took hours now happen automatically. The director gets better information because the reports are consistent and timely. The research database grows continuously instead of requiring periodic manual exports.
The language switch was frustrating but correct. I lost a day rewriting what I’d built in Go, but gained it back quickly by having libraries that actually worked. It’s a good reminder that ecosystem matters as much as language features.
Healthcare data work requires thinking carefully about privacy at every step. It’s not something you bolt on at the end - the de-identification logic influenced how I structured the data from the start, which fields I kept, and how the databases were segmented.