🩺 Symptom-to-SNOMED on Construe

How we built a symptom-to-SNOMED demo on Construe, and the chunking, validation, and codes-per-chunk decisions that actually mattered

We made a short video walking through a tiny app we built: type a symptom in plain clinical shorthand, get back the right SNOMED codes. It's two minutes, and it's the fastest way to see what this guide is about before we get into the why. Here it is:


In the video, you type this into the box:

SOB, cough x3d, fever 38.5°C

A second later, three cards slide in:

  • SOB → Dyspnea (finding) · SNOMED 267036007
  • cough x3d → Cough (finding) · SNOMED 49727002
  • fever 38.5°C → Fever (finding) · SNOMED 386661006

That's the whole demo. We built the thin frontend in an afternoon with Claude, and one call to our Construe API. The cards sliding in two seconds later are the trivial part. The decisions that happened before they rendered are the part worth writing about, and they're the part you'll make differently when you build your own.

The demo is deliberately plain to give you a sense of capability. A clinician types the kind of thing they actually write, abbreviations and durations and a temperature with a degree sign, and validated SNOMED CT concepts encode the data behind the scenes so they can be dropped into a Condition resource. Every interesting decision happened underneath, before the cards rendered.

Getting a code back is the easy part. Getting the code a clinician would have picked, for each symptom, with nothing left over and nothing invented, is where the work goes. And most of that work isn't in the model call. It's in the configuration around it.

What is Construe

Construe rapidly extracts medical codes from clinical text. You hand it unstructured text and a terminology system (SNOMED CT, ICD-10, LOINC, RxNorm, or a custom code set you upload) and it returns the codes present in that text, with citations back to the span that produced each one.

SNOMED has hundreds of thousands of active concepts, and the set shifts with every release. Keeping extraction correct against a terminology that moves underneath you is a standing maintenance problem, not a one-time integration. Construe retrieves against the live terminology instead of asking an LLM to recall it, so the codes it returns actually exist in the version you asked for.

It also reads context rather than matching strings, which is the part the demo makes visible. Type:

results indicate CKD

and Construe returns Chronic kidney disease. It resolved the abbreviation rather than doing a literal "CKD" lookup. The more telling part is what comes back alongside it. From that same short phrase it surfaces the lab concepts a nephrology reader would expect to travel with CKD: ACR (albumin-to-creatinine ratio) and eGFR (estimated glomerular filtration rate). Neither acronym is spelled out in the input. Construe pulls them from the clinical context around the diagnosis, the way a clinician reading "results indicate CKD" already knows which results are meant.

We rely on this enough that Construe sits underneath our other products. Lang2FHIR uses it so that the FHIR resources it generates carry validated codes instead of invented ones. If you want the reasoning behind not letting an LLM emit codes directly, the story of how we built Construe walks through why we treat non-determinism as an engineering problem rather than waiting for the next model.

Two public endpoints matter for this guide:

  • POST /construe/extract extracts codes from text. This is the one the demo calls.
  • POST /construe/upload registers a custom code system, for the organization-specific vocabularies most healthcare companies turn out to have.

The demo only touches extract. Everything below is about how that one call is configured.

%% caption: The Symptom to SNOMED demo request path, conceptually. The text is preprocessed, sent to Construe /extract, and the cards return immediately while a FHIR write-back fires in the background.
flowchart LR
	U(["Clinician types<br>SOB, cough x3d, fever"]) --> PP["preprocess<br>commas to sentences"]
	PP --> CX["POST /construe/extract"]
	CX --> CARDS(["SNOMED cards<br>returned to user"])
	CX -. "fire-and-forget" .-> WF["write FHIR Condition<br>in the background"]

Why "extract the codes" is three problems

Look again at the input:

SOB, cough x3d, fever 38.5°C

A clinician reads that in under a second. The pipeline has three decisions to get right underneath it.

First, it's a list, not a sentence. Construe chunks text before it codes it. It breaks the input into spans and extracts per span. Hand it the raw comma-separated string and the sentence chunker sees one chunk: "SOB, cough x3d, fever 38.5°C." The per-chunk extraction then has to decide what the dominant concept of that blob is, and you get one code where you wanted three.

%% caption: The same input under two chunking outcomes. Left: the raw comma list is seen as one chunk and collapses to a single code. Right: preprocessing splits it into three sentences, each chunked independently into its own code.
flowchart TB
PRE["Preprocessed, one chunk per symptom"]
		B1(["SOB. cough x3d. fever 38.5C."]) --> B2["3 chunks"]
		B2 --> C1["SOB"]
		B2 --> C2["cough x3d"]
		B2 --> C3["fever"]
		C1 --> D1(["Dyspnea"])
		C2 --> D2(["Cough"])
		C3 --> D3(["Fever"])

The fix is to split the input before the API ever sees it: break on commas and semicolons, and turn each item into its own sentence. Now the sentence chunker sees three chunks, and each chunk is one symptom. It's a small step, and it's the difference between three clean cards and one merged one. The preprocessing you do before the API is often what determines the quality of what comes back from it.

This particular step assumes comma-delimited symptoms, which is the right assumption for triage-style shorthand and the wrong one for prose. A real narrative sentence like "patient reports shortness of breath with a three-day cough and a fever" passes through the split untouched and gets handled by the sentence chunker on its own, which is fine. Something comma-heavy that isn't a clean list, like "chest pain, worse on exertion, radiating to left arm," gets over-split into fragments, which is not. For a general note parser you'd reach for a smarter segmentation method; Construe exposes seven chunking methods across four families, from rule-based to clinical NER. The demo picks the one that fits its input.

Second, how hard to validate. Construe can run a validation pass on each candidate code, checking that the code fits the source span. It's the safe default, and it costs latency, because it's another pass per code. It is invaluable when you're testing inputs and wanting to get highest accuracy. The demo turns it off with validation_method: none, as a deliberate trade. The per-code validation pass dominated the response time, and for a live demo where someone is watching cards appear, perceived speed is worth more than the extra check. The candidate codes are still real SNOMED concepts retrieved from the terminology; we're skipping the does this code best fit this span judgment, not the is this a real code guarantee. In a revenue-cycle pipeline where a wrong code has a dollar cost, you'd leave validation on and take the latency. The knob exists because the right answer is different for this demo than for a billing use case.

Third, how many codes per chunk. SNOMED's default returns up to ten candidate codes per chunk, which is right when a chunk is a paragraph and you want recall. After preprocessing, each chunk is a single symptom, so ten codes per chunk would mean ten cards for "SOB," nine of them noise. The demo caps it at one with max_codes_per_chunk: 1. One symptom in, one code out. This works because of the preprocessing decision above. The chunking choice and the codes-per-chunk choice are the same decision from two angles: each chunk was made atomic, so one code per chunk is the right ask. With coarser chunking, capping at one would discard real codes from a chunk that legitimately held several.

Three knobs that read as API trivia, and together they're the design.

%% caption: The three configuration decisions and how they relate. They are not independent; the codes-per-chunk cap is sound only because chunking was made atomic first.
flowchart TB
	D1["chunking_method:<br>sentences"] -->|"makes each chunk<br>one symptom"| D3["max_codes_per_chunk: 1"]
	D2["validation_method:<br>none"] -->|"trades accuracy<br>for latency"| SPEED(["fast response"])
	D3 -->|"one code per<br>atomic chunk"| CLEAN(["clean 1:1 cards"])
	D1 -. "coarse chunking makes<br>the cap discard codes" .-> D3

The one architectural move worth copying

There's one structural decision in the demo worth pulling out: the fire-and-forget FHIR write-back. After the cards go back to the user, a background task hands the same results to a PhenoML Workflow that writes FHIR Condition resources into a sandbox. The user's request never waits on it.

That decoupling is the right instinct, and it's the one piece you should carry into anything real. The extraction call and the persistence call have different latency and failure profiles, and coupling them lets the slow, failure-prone one (writing FHIR through a workflow) hold the fast, user-facing one hostage. The user's job here is to see the codes; a persistence hiccup shouldn't stall the cards or surface an error over a feature they didn't ask for.

Where the demo cuts a corner you wouldn't: it swallows every write-back error silently. That's fine when the only goal is showing codes on screen, but "best-effort, log nothing" is also how silent data loss happens. In production you'd keep the fire-and-forget decoupling and replace the silent failure with a logged one and a retry, so the write-back is asynchronous without being invisible. The demo shows the shape; production adds the observability that shape needs.

Build your own, starting in the console

You don't need our app. The honest takeaway from a demo we built in an afternoon is that the app is the easy part: once Construe handles the retrieval, what's left is a handful of config decisions and a frontend. We wrote ours with Claude Code in a single sitting, and yours will look different because your use case, input, and your accuracy-versus-latency trade-offs are different. That's the point. The config is the product decision, and it's yours to make.

The fastest way to get a feel for it is to test it out directly from the console. Sign up, grab credentials, and send it your own clinical shorthand.

The inputs that teach the most aren't the clean ones. A few worth trying in the console:

  • A real narrative sentence, to watch the sentence chunker do the work the comma-split can't.
  • A comma list with a multi-word concept inside it ("chest pain, radiating to left arm"), to watch naive preprocessing over-split.
  • Something context-dependent like "results indicate CKD," to see what Construe pulls from around the diagnosis.
  • The same input with validation_method flipped back on, to feel the latency that buys.
  • max_codes_per_chunk raised to 3 on a paragraph, to see the recall a strict cap leaves on the table.

Each of those turns one of the decisions above from a line of config into something you can see. The docs cover the full config surface for Construe, including the upload endpoint for custom vocabularies. Build the version that makes the calls you need, and start in the console.

The demo's config at a glance

Config keyDemo valueDefaultWhy the demo deviates
chunking_methodsentencesn/apreserves byte offsets for citations; pairs with comma→sentence preprocessing
validation_methodnoneper-code checkper-code validation dominated latency; demo trades it for perceived speed
max_codes_per_chunk110each preprocessed chunk is one symptom, so one code is the clean answer
include_citationstruen/alets each card show the source span it came from
include_invalidfalsen/adrop anything not marked valid before rendering