• Skip to main content

Annielytics.com

I make data sexy

  • About
  • Tools
  • Blog
  • Portfolio
  • Contact
  • Log In

Jun 18 2025

You Don’t Need Deep Pockets to Fine-Tune AI: My $50 POC

Fine-tuning an LLM to translate medical notes

A common point of pushback I get on AI projects is fine-tuning models isn’t an option because it’s cost prohibitive. That really hasn’t been the case for at least two years, but old fears die hard. So I set out to create a proof of concept (POC) that I can point clients to in order to assuage their fears about model tuning.

When Prompt Engineering Just Doesn’t Cut It

Fine-tuning becomes necessary when prompt engineering alone can’t reliably produce consistent or domain-specific results. If your application depends on highly specialized language—like legal, medical, or technical jargon—or requires responses to follow strict formatting, tone, or workflow rules, prompt engineering can only take you so far.

Bloating inputs with repeated prompt instructions also adds cost. Prompts must be re-sent with every input, inflating token usage and slowing down response times. Fine-tuning bakes those instructions into the model itself, making inference faster, cheaper, and more predictable. Fine-tuning allows you to directly shape the model’s behavior using examples, rather than relying on prompt hacks to guide a general-purpose model. It’s especially useful when you need the model to perform well on a narrow task across thousands of inputs.

Think of fine-tuning like hiring developers who already know Python but requiring them to take your internal course on code style, folder structure, and version control so they follow the structure and format established in your organization. You don’t need to make them read a book on coding basics to train them from the ground up. That would be overkill. You just need to (dare I say…) fine-tune their approach.

The Tectonic Shift of Model Tuning

What once required Silicon Valley-sized budgets and dedicated ML teams has become accessible to individual developers and small organizations. In the early days of generative AI development, fine-tuning AI models was the exclusive domain of tech giants with deep pockets. A single project could easily consume six-figure budgets between computational costs, specialized talent, and infrastructure requirements. Today’s reality, however, tells a different story. Modern fine-tuning techniques, like parameter-efficient training (e.g., LoRA)—coupled with democratized cloud computing and sophisticated open-source frameworks—have collapsed these barriers.

My Use Case: Use AI to Translate Medical Notes to Plain English

Medical documentation is filled with cryptic abbreviations like ‘SOB w/ h/o HTN and DM2’. This shorthand comes with a high learning curve. So, while a cardiologist immediately understands this as ‘shortness of breath with history of hypertension and type 2 diabetes’, patients and their families who swim downstream from these notes can feel like they’re deciphering medical hieroglyphics.

With the rise of patient portals, turning clinician shorthand into plain English can dramatically boost transparency and patient satisfaction with a level of openness unheard of 30 years ago.

Compiling the Training Data

The foundation of any successful fine-tuning project is a high-quality dataset. I initially tried creating a synthetic dataset stored as a jsonl file. A jsonl file is a text file where each line is a separate json object. These are commonly used for storing large datasets that can be processed one record at a time. Below is a snippet of that file.

{"input": "Admitted for VSS, also noted NPO.", "output": "Admitted with vital signs stable, also noted to be nothing by mouth."}
{"input": "UF improving; continue OR therapy.", "output": "Ultrafiltration improving; continue oral therapy."}
{"input": "Progress: HOB stable, monitoring DTRs.", "output": "Progress: head of bed position stable, monitoring deep tendon reflexes."}

I generated this dataset by lobbing the code back and forth between ChatGPT (using its o3 model) and Claude (using its Sonnet 4 model). Claude was much better at auditing and generating realistic data for this project. ChatGPT had entries that were downright silly (examples below).

Model Selection and Training

For my demo I decided to use a free, open model and fine-tune it with custom data to see if I could train it how to translate medical notes into plain English. I first benchmarked several models without tuning. Models above 7 B parameters already did reasonably well with medical notes. However, since I wanted to specifically find a model that would benefit from fine-tuning, I went smaller to test the lower bounds of some of these open models.

However, when I started testing smaller models with the synthetic data, I saw that they still struggled with more complex notes. It would’ve been fair to conclude that I needed a model that had at least 7 B parameters. But, out of sheer curiosity, I wondered if I could salvage smaller models by augmenting the training dataset. So I did some digging in Kaggle and across the web and eventually stumbled across a goldmine of abbreviation translations in this pdf published by Madison Memorial Hospital in Rexburg, Idaho.

I used the pdfplumber Python library to extract a list from the pdf and then augmented my jsonl file by asking Claude to expand that dataset. This coverage boost helped the model be able to interpret more complex notes.

With a beefier dataset, I tested two models under 1 B parameters, e.g., DialoGPT-medium and DialoGPT-large. But even with the training data, they just weren’t hefty enough to reliably handle the notes—or even follow instructions. And in some instances they output gibberish.

I eventually landed on Microsoft’s Phi-3-mini-4K-instruct (3.8 B parameters). It was small enough for Colab yet powerful enough to handle nuanced clinical text. Below is a list of my final contestants, though I had spun my wheels a few rounds before finalizing this list, so it’s not exhaustive as I unfortunately deleted the cell where I had tracked all the models I tested.

I used Colab to train my model, but you could also fine-tune using cloud tools like Amazon SageMaker, Azure ML, Paperspace, or Lambda Labs, to name a few. I also could have done this fine-tuning on Colab’s free tier if the dataset and model had been a little smaller, but the volume of training examples and the model size I needed to translate more complex notes required Colab Pro+ ($49.99/month) for faster GPUs. Hence the $50 bill.

Test Results

After testing various LoRA hyperparameters—such as rank, dropout, and bias—I found a configuration that produced stable translations with minimal memory usage. Adding k_proj and o_proj to the target modules further improved performance, allowing the model to translate medical abbreviations with reasonable accuracy. (These layers help the model better understand contextual differences and generate more precise outputs, especially in jargon-heavy domains like medicine, but if we’re truth telling, Claude offered the suggestion and it just happened to be my final tweak before I saw results I was happy with.)

Click for larger image

Below are a couple translations generated by the tuned model.

📝 Note: ER visit for SOB and CP. CXR reveals bilateral infiltrates. CBC shows elevated WBC. Start ABX.
🤖 Response: Emergency room visit for shortness of breath and chest pain. Chest X-ray reveals bilateral infiltrates. Complete blood count shows elevated white blood cell count. Start antibiotics.

📝 Note: MVA victim w/ LOC. CT head negative. C-spine cleared. Neuro exam WNL. Discharge home.
🤖 Response: Motor vehicle accident victim with loss of consciousness. Computed tomography head negative. Cervical spine clear. Neurological exam normal. Discharge home.

Of course, if this were a client-facing project, it would require far more rigorous testing and likely benefit from additional hyperparameter tuning. But for a POC—or even a minimum viable product (MVP)—it’s more than sufficient.

Learn to Fine-Tune LLMs

If you want to learn to fine-tune LLMs, Google has a number of excellent, free machine learning courses. Its Machine Learning Crash Course has a module dedicated to fine-tuning LLMs.

Access My Files

You can access my Colab file here and the data and model here.

Image credit: Jonathan Borba

Written by Annie Cushing · Categorized: AI

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Copyright © 2025