Advanced
Data Services

When the problem is harder than installing software.

Talk to Us

We Are Not Just Installers

Most archival hosting providers install the software and hand you a login. We do that — and then we handle what comes next: parsing twenty years of Word documents into a structured ISAD(G) hierarchy, cleaning 40,000 authority records full of encoding artefacts and spelling variants, building semantic search over a transcribed manuscript collection, generating IIIF manifests for a photographic archive with no existing metadata. These are archival data problems. They require someone who understands both the data and the standards it needs to conform to.

If a previous migration attempt stalled because your data was too inconsistent — that is exactly what this service is for. Tell us what happened.

🔧

Complex Legacy Data Extraction

Standard CALM exports and CSV imports are only the beginning. We write bespoke extraction and transformation scripts for systems that don't export cleanly: nested FileMaker Pro databases, bespoke Microsoft Access catalogues, proprietary XML schemas from obsolete systems, and Adlib formats that predate clean Unicode support. We have seen the full spectrum of what decades of bespoke cataloguing produces.

🐍

Metadata Engineering at Scale

We use Python and OpenRefine to clean, normalise, and restructure archival metadata at scales that cannot be handled manually. Inconsistent date formats across 50,000 records. Duplicate authority entries across thirty years of different cataloguers. Scope notes with embedded tabular data that should be split into structured fields. Free-text extents that need parsing into numerical values. We map the mess to ISAD(G) fields and import it clean.

📄

Finding Aid Parsing

Most UK religious, society, and business archives have their entire catalogue locked inside Word documents or PDFs. We parse these programmatically — extracting headings as hierarchy levels, pulling dates and extents into ISAD(G) fields, identifying creators, and building ISAAR(CPF) authority records. The result is a properly structured ISAD(G) catalogue importable into AtoM, from what was previously a text document.

✍️

Handwritten Text Recognition (HTR)

If your collection includes parish registers, estate papers, medieval manuscripts, or any significant volume of handwriting, we run HTR transcription pipelines using eScriptorium — the open-source alternative to Transkribus, with no per-page fees. We train models on your specific scripts and hands, batch-process the collection, and deliver structured, searchable transcription output. The resulting text can feed directly into semantic search, AtoM scope notes, or TEI export — not just a folder of raw text files.

🖼️

Batch IIIF Manifest Generation

Photographic collections, manuscript pages, and digitised print collections all need IIIF manifests — but generating them one by one is not viable at scale. We build batch pipelines that ingest images from whatever source format exists (a folder of TIFFs, a Flickr export, a CONTENTdm collection), extract or generate metadata, and produce standards-compliant IIIF Presentation API manifests ready for Mirador, Universal Viewer, or any other IIIF-compatible viewer. Compound objects, sequences, and ranges handled. Live example: passionistarchives.co.uk →

🔍

Semantic Search for UK Text Collections

Standard keyword search misses the way researchers actually think. Semantic search — exact-match meaning search using vector embeddings — finds relevant passages by concept, returning results for "references to the bishop's involvement in land disputes" even when none of those exact words appear in the text. We build and host semantic search over transcribed manuscript collections, obituary corpora, diaries, correspondence series, and other large UK text archives. Results use verbatim span extraction: researchers get a citable passage from the original source — not a hallucinated summary. Request a demonstration →

📸

AI-Assisted Photographic Processing

Large photographic collections — particularly those accumulated over institutional lifetimes — frequently contain thousands of images with no systematic metadata. We have built custom computer vision pipelines and vector similarity indexing to assist identification of individuals across large unidentified collections, linking photographic records to existing authority files where matches are found. This is an investigative tool that supports — not replaces — archivist judgement. Available as a scoped engagement: contact us to discuss your collection.

What Makes This Different

These services are built on professional archival knowledge, not generic data engineering. When we parse a Word finding aid, we know what a fonds-level description looks like versus a series-level one — and what to do when the original cataloguer mixed levels inconsistently. When we clean authority records, we know the difference between a personal name variant and a genuine duplicate. When we build semantic search, we index against the actual text archivists and researchers care about, with verbatim span extraction — exact, citable passages from the original document — so results are always traceable back to their source.

We do not hand you a spreadsheet of suggestions. We produce import-ready data, tested against a staging environment, verified against the source system before cutover.

Who This Is For

Archives with complex legacy systems — bespoke databases, proprietary formats, decades of bespoke cataloguing that doesn't map cleanly to any standard export format
Collections with large photographic holdings — unidentified or under-described images that need systematic processing before they can be published online
Collections with significant manuscript or handwritten holdings — parish registers, estate papers, medieval documents, 20th-century correspondence — needing HTR transcription before they can be catalogued or searched
Research portals and digital humanities projects — needing semantic search over transcribed manuscripts, correspondence, or obituary collections
Institutions with large-scale digitisation programmes — needing batch IIIF manifest generation and metadata enrichment for hundreds or thousands of objects
Anyone whose data is too messy for a standard migration — if a previous attempt at migration stalled because the data was too inconsistent, come to us

Related Services

Migration Guide Get Your Archive Online IIIF Solutions eScriptorium / HTR

Tell Us About Your Data Problem

We will give you an honest assessment of what's involved and whether we can help.

📧 info@archiveshosting.co.uk

AdvancedData Services