The Challenge of Scanned Medical Records

Medical records in personal injury cases rarely arrive as clean, searchable digital documents. In practice, bundles typically consist of a mixture of digitally produced letters and reports, scanned handwritten GP notes, photocopied hospital correspondence, faxed referral letters, and older records produced on typewriters or early word processors. Many of these documents have been scanned multiple times, resulting in degraded image quality, skewed pages, and inconsistent orientations.

AI analysis depends entirely on the quality of the text it reads. A model that receives accurate, complete text from a medical bundle will produce a thorough and reliable review. A model that receives partial or corrupted text — because the OCR stage failed to correctly interpret a page — will produce a review with gaps. OCR quality is not a peripheral technical concern; it is the foundation on which everything else rests.

Reawoken treats OCR as a first-class part of the processing pipeline. Rather than using a single OCR engine and hoping for the best, it employs a smart approach that applies industry-leading OCR technology selectively and at the appropriate quality threshold for each page of the bundle.

Smart PDF Analysis: OCR Only Where Needed

Not every page in a medical bundle requires OCR. Many pages — particularly modern clinic letters, GP summary reports, and electronically generated discharge summaries — already contain searchable text embedded in the PDF. Running full OCR on these pages would add processing time without improving accuracy; in some cases, it would actually introduce errors by substituting machine-read text for the higher-quality embedded version.

Reawoken analyses each page of an uploaded PDF before deciding whether to apply OCR. Pages that already contain more than 500 characters of embedded text are passed directly to the AI analysis stage using their existing text content. Only pages with insufficient embedded text — scanned images, blank pages with stamps, handwritten notes — are sent through the OCR pipeline.

This selective approach significantly reduces processing time for large bundles, while ensuring that OCR resources are directed precisely where they are needed. A 400-page bundle where 250 pages are digitally produced letters processes substantially faster than one where every page is sent through OCR, without any reduction in the quality of the output.

Dual OCR Engines for Maximum Accuracy

Reawoken integrates two enterprise-grade OCR engines: AWS Textract and Azure Document Intelligence. Both are specifically designed for document processing at scale and have been trained on a wide variety of document types, including handwritten text, tabular data, and mixed-format layouts.

AWS Textract

Amazon's document analysis service excels at structured document types, form fields, and tabular layouts. It is particularly effective with GP summaries, structured hospital letters, and documents with clear formatting.

High accuracy on structured documents
Strong table and form detection
Scalable processing for large bundles

Azure Document Intelligence

Microsoft's document processing service provides strong performance on unstructured and handwritten content, including older clinical records and mixed-format pages with stamps, annotations, and degraded quality.

Robust handwriting recognition
Handles degraded scan quality
Effective on unstructured clinical notes

Having dual OCR engine support means Reawoken can be configured to use the most appropriate engine for the types of records a particular firm typically receives, and provides resilience in the event that one service experiences downtime or degraded performance.

Handling Difficult Medical Documents

Medical records present a uniquely challenging set of document types. GP handwritten notes — particularly those dating from before the widespread adoption of electronic clinical systems — are often written quickly, with abbreviations, non-standard terminology, and layouts that vary significantly between practices and individual clinicians.

Reawoken's OCR pipeline handles this variability by processing each page as an independent document image rather than assuming a consistent layout across the bundle. Pages with stamps, handwritten annotations in margins, mixed typewritten and handwritten content, and photocopied documents with grey backgrounds are all processed without requiring any pre-processing or manual intervention.

After OCR, the extracted text is passed through a post-processing stage that corrects common OCR artefacts, normalises spacing and line breaks, and prepares the text for ingestion by the AI analysis pipeline. This ensures that the AI receives clean input even when the underlying document quality was poor.

Lloyd George Records and Handwritten GP Notes

Many personal injury bundles — particularly for older claimants or those with lengthy medical histories — still contain records originating from Lloyd George envelopes. Named after the former Prime Minister whose legislation laid the foundations for national healthcare, Lloyd George envelopes are small card wallets (roughly 130 mm × 180 mm) that were used by GP practices across the NHS from 1948 onwards to store a patient's entire paper medical record. Each envelope holds handwritten consultation notes, specialist correspondence, prescription records, vaccination details, and any other clinical documentation accumulated over the patient's lifetime.

Although the NHS stopped issuing new Lloyd George envelopes in 2021 and many practices have since digitised their paper records, the scanned versions of these documents frequently appear in medical record bundles disclosed in litigation. The scanned quality is often poor: the original notes were small, densely written, and in many cases decades old. Faded ink, creased paper, stamps overlapping handwriting, and annotations squeezed into margins are all common characteristics of Lloyd George record scans.

Reawoken's OCR pipeline is designed to handle these documents. Each page is processed as an independent image regardless of its size, orientation, or condition, and the dual OCR engines are capable of extracting text from degraded scans that would defeat simpler OCR tools. For solicitors working with older claimants whose GP histories stretch back several decades, this means that the pre-accident medical picture captured in Lloyd George notes is not lost to poor scan quality.

Handwritten Notes

Handwritten GP notes are one of the most common concerns raised by solicitors considering AI-powered medical record review. The question is understandable: if a human reviewer struggles to read a GP's handwriting, how can an OCR engine be expected to do better?

The answer lies in the capabilities of modern enterprise OCR technology. Both AWS Textract and Azure Document Intelligence have been trained on vast datasets of handwritten documents and are specifically designed to interpret cursive and semi-cursive handwriting, including the abbreviated, rapidly written style typical of clinical notes. They do not require neat, printed text to function — they are built for exactly the kind of imperfect handwriting found in real-world medical records.

No OCR engine achieves 100% accuracy on every handwritten page — particularly where the original handwriting is genuinely illegible even to a human reader. However, the combination of enterprise-grade OCR with Reawoken's post-processing normalisation produces consistently usable text from the vast majority of handwritten clinical notes. Where a human reviewer would spend minutes deciphering a single page of GP handwriting, the OCR pipeline processes it in seconds and passes the extracted text directly to the AI analysis stage.

Built for Large Bundles

Medical record bundles in personal injury cases range from a handful of pages to several thousand. NIHL cases in particular frequently involve GP records spanning 30 to 40 years, which can run to 800 or 1,000 pages or more. Employers' liability cases may include medical records from multiple treating hospitals, physiotherapy centres, and occupational health providers.

Reawoken supports individual file uploads of up to 100 MB and bundles of up to 1,250 pages. This covers the full range of bundles encountered in practice without requiring solicitors to split files manually or process records in batches. The smart page-level OCR approach means that processing time scales efficiently even with large uploads, since OCR is only applied to pages that require it.

Files are processed securely over encrypted connections and stored in AWS S3 with access controls restricting visibility to the firm that uploaded them. OCR-processed text is retained only as long as necessary for the analysis and export workflow, in line with GDPR data minimisation principles.

The Foundation for Reliable AI Analysis

The quality of an AI medical record review is inseparable from the quality of the text it receives. An AI that reads accurate, complete text from a 500-page bundle can identify every relevant medical entry, produce a thorough chronology, and flag every pre-existing condition. An AI reading corrupted or incomplete OCR output cannot — and the gaps in its analysis may not be visible to the reviewer.

Reawoken's investment in OCR quality reflects this reality. Smart page-level detection, dual enterprise OCR engines, and post-processing normalisation combine to ensure that the AI analysis stage receives the best possible text input from every uploaded bundle — regardless of the age, format, or condition of the original documents.

Medical Record OCR Extraction