Meeting minutes 2020-04-22
22 April 2020 3:00 p.m. UTC (time zone calculator)
Attendees
- Artefactual: Sara Allain, Sarah Romkey, Justin Simpson, Jesús Garcia Crespo
- Bodleian: Susan Thomas, Matthew Neely, James Mooney, Edith Halvarsson
- International Institute of Social History: Eric de Ruijter, Robert Gillesse
- Max Communications: Geoff Blissitt, Tim Schofield
- Picturae: Wim van Dongen
- Wellcome Collection: Tom Scott
Regrets: Wendy Maule, Jonathan Tweed
Agenda
3:00 UTC - Meeting starts (Sarah Romkey)
3:05 UTC - Housekeeping (Sara Allain)
3:10 UTC - Archivematica core definition (Sarah, Justin, Jesús)
3:40 UTC - Bodleian metadata development work (Matthew Neely)
4:10 UTC - Any other business (Sarah Romkey)
4:25 UTC - Meeting wrap-up (Sarah Romkey)
Notes
Housekeeping
- Led by: Sara Allain
- Meeting minutes from last meeting are in Github (at psp.archivematica.org)
- Will send out internal version of these minutes for review, we can remove anything private before posting to PSP site
Archivematica core definition
- Led by: Sarah Romkey
[SR]
- Promise from last meeting is that we’d have something for this meeting; struck a working group at Artefactual (JHS, JGC, SR)
- Decided that we needed to present an outline to confirm that we were doing something that made sense and was useful and we were on the right track
- Some things that we did:
- Looked at other projects and how they document their core definition
- Thought about what the most useful things would be for the Archivematica community and the project
- Started a document
- One question that came up: should it be status quo or aspirational? Design document or historical document?
- Main sections:
- Data model and definitions section - glossary of what things mean in the Archivematica context, what standards are we committed to using both in the past and future (e.g. AIP, PREMIS and METS, etc.)
- Have been thinking about event-based modeling and development, wanted to include for future
- How to deploy and what the dependencies are
- Reporting - at a minimum, access to data; at a maximum, what it looks like inside Archivematica itself
- API - status, endpoints, preservation planning, events
- Inputs and outputs - what should you be able to put in and get out?
- OAIS - AM claims to be an OAIS-based system, what does this mean in a practical sense?
[JHS]
- Recognizing the distinction between AM as software vs as a preservation system - e.g. it’s possible to install AM without Elasticsearch, which is fine for the software, but in a classically-defined preservation system you need some way to index and search for material
- Jesús suggested that maybe we should be looking at the minimum definition of Archivematica - what is the smallest thing you could have that would be called Archivematica? The answer is likely to be the part that produces AIPs
- This smallest part can be combined with other aspects to build out a preservation system
[JGC]
- Started looking into the idea of a minimal definition
- One of AM’s main problems is complexity of deployment, need to deploy multiple databases/dependencies that are really hard to manage
- Did an experiment to try to deploy Archivematica with the minimum amount of resources - e.g. moving Elasticsearch, Storage Service to streamline things like package creation
- Experiment has been really useful to identify parts of the codebase that are really hard to maintain
- There are a number of features that aren’t well-supported
[JHS]
- Important to consider the end goals of AM - functions as more than an AIP-making tool, e.g. some users use it for preservation planning
- Functionality varies a lot by institution depending on institutional policies
- Historically Archivematica has assumed that there is not any other software involved in the system, but there’s a big difference between a system that is monolithic and one that can integrate with any system
[SR]
- Had an “aha” moment this week talking about integrations - we have long described AM as agnostic to storage/access systems but it’s not truly agnostic - we’ve just been willing to build integrations with any other system
[EDR]
- Agrees with Sarah’s point - AM should integrate easily with other platforms outside of AM itself to make it easy
[WVD]
- Need to be aware that AM is in the midst of a workflow chain
[ST]
- Will the document lay out the context? Does core mean supported by the community, or is this just a means of describing things in a modular fashion?
[SR]
- Will add a context section
- Was thinking about it as core is what we commit to maintaining across versions; if an institution is relying on something that is outside the core, there might be an extra maintenance burden on the user
[JHS]
- We don’t have a specification for how Archivematica produces metadata, for example; looking at e-ARK project for examples
- Helpful for PSP institutions to think about what they need to see in this document; in a sense this document is an assertion by the community
- E.g. at Artefactual we have a lot of clients who use the search in AM as their only interface to the AIP store, so there needs to be some search or AM is not usable by them
[GB]
- I think a good starting point for this type of exercise is to ask “how is it useful”?
- And part of that is to ask to what extent does the core Archivematica align with a definition of a core digital preservation system - e.g. Digital Preservation Coalition work
[RG]
- “Workflow” is missing from the document, Archivematica functions as a workflow tool for IISH but also is part of a bigger workflow
[JHS]
- Two kinds of workflows - the one that produces AIPs and DIPs, and the overarching workflow that integrates with other systems
- Need to think about them separately - what are the steps to make packages? This is definitely a core thing.
- Then integrations - what are the common workflows across institutions
[RG]
- One of the reasons that IISH liked AM is ability to wrap in pre-ingest work
[SR]
- We’ve used events somewhat analogously with workflows
[JGC]
- Events are a good way to model workflows
- We’ve experimented with better workflow automation through Enduro, which solves the problem of workflow orchestration
- Workflow is a bigger umbrella for workflows across multiple systems
[JHS]
- When thinking about core, we might consider what can be taken away or changed without breaking the ability to make AIPs - e.g. can take away or change search, but can’t change preservation planning or else AIP creation will break
- Maybe need to define everything that AM needs to do/know to make an AIP and nothing more than that
[SR]
- Next steps - add context and introduction, information about workflows
- Will be working on filling the sections of the document in over the next few months
[GB]
- In terms of the status quo or aspirational question. Should we do both?
[SR]
- We could do that, mark certain sections as status quo or aspirational
[JHS]
- Focus on status quo
[WVD]
- Would like to add, in outputs - reporting of technical metadata
[JHS]
- Aggregate information isn’t very robust in Archivematica, but there are a lot of ways that you could do this
[TSchofield]
- Hard to organize work without being able to interrogate the system
[WVD]
- Have been providing use cases to Artefactual that deal with reporting, will make available in GDrive
- Some currently Picturae development work will hopefully address some of this, e.g. being able to download info from Archival Storage tab
[SR]
- Maybe we can discuss use cases for reporting in next meeting?
[JGC]
- We currently think of reporting as a web interface, and we think about the outputs that are currently possible (e.g. database information about actions that have happened) - we could be outputting all information from all of the digital preservation steps, which would give us a much more flexible understanding of what could be reported in Archivematica. E.g. could include user interactions
[JHS]
- Ensuring that the data model is clear and that the specification for the data model is clear makes it easier to extract data
Bodleian metadata development work
Led by: Matthew Neely
- Proof-of-concept project in 2019 found two main gaps - one of them was metadata
- Identified two problem areas - metadata verbosity, and migrating legacy metadata (latter is Bodleian-specific)
- Bodleian working with very large AIPs and will continue to generate these in future, 10,000+ files
- Can’t even open some of the METS files
- Human understanding is difficult because the METS file is so big and nonlinear
- Small AIP - 8,825 lines of XML
- Lots of repetition - dates, formats, etc.
- Had a workshop with Kelly and Justin in May 2019, looked at developing an RDF Turtle version of the METS
[JHS]
- This goes back to the Jisc project, analysis in that project described how AM currently writes preservation events at the level of the file. PREMIS 3 would allow us to write preservation events at the level of the intellectual entity, e.g. instead of 30,000 virus-scanning events, just one for virus scanning over the whole transfer
- Have done some work since then to define steps needed to continue this work - need to prove out this work. There are other tools and systems that rely on the current structure of the METS. Need to develop a solution that is backwards compatible.
- Bodleian is proposing a second option, not a replacement
- Currently working to analyse the current AM METS file to pull out things that they think are critical, what can be aggregated, what metadata is not essential?
- Over April/May Matthew, Susan, and Edith will be working on developing new metadata model
- Hoping to have a developer in-house work on implementing the new model approx. Sept/Oct
- Keen to work with other PSP members to align any related work
[WVD]
- Is this mainly about technical metadata, or about content metadata (e.g. descriptive?)
- Content metadata still provides for the possibility that the METS file will explode (e.g. if there is extensive descriptive metadata for each item)
- Currently working with City Archives of Amsterdam on complete renewal of their infrastructure, have made a bold decision that Picturae will offer them a new collection management system based on RiC
[RG]
- Verbosity is mostly caused by technical metadata
[JHS]
- Two ways that METS files explode - 1) repetition of preservation events, but also 2) repetition of technical metadata (e.g. near-identical FITS output repeated 30,000 times)
- can be solved by using aggregates as is possible with PREMIS 3
- is more related to Matthew’s point about what metadata is valuable
[All]
- WVD and TSchofield both would like to get involved with development later in the year
[SR]
- This is a priority for Artefactual as well
[TScott]
- Interested in building a metadata specification where there is no upper limit on the amount of material information that can be written into the metadata
- Wellcome also interested in contributing materially to this effort
[JHS]
- There are several ways that upper limits are reached in AM - working on ways to scale up to support millions of files
[SR]
- Aligns with archival principles - just because we can save everything, doesn’t mean that we have to (or should); this is a positive step
Any other business
Roadmap wishlist consolidation
[SR]
- Spreadsheet here
- Mapped roadmaps from different institutions to six main themes - Config/Integrations, Knowledge building, Open source ecosystem, Preservation actions, Architectural requirements, Automation
- These themes are based on previous presentations that Sarah has done about Archivematica
- Automation is one that is on Artefactual’s priority list but didn’t appear on anyone else’s list, mainly driven by Enduro
- Config/Integrations takes up a big chunk
- Open source ecosystem takes up a small chunk, but it is strongly coupled with architectural requirements - if architectural changes aren’t made, AM becomes non-viable
[RG]
- Very helpful to see all these things put together
- Is there weight put to things that have multiple institutions behind them?
[SR]
- Yes - as product owner, definitely informs how these things should be prioritized
[TScott]
- Some of the items are very similar, would be interested to see if this could be written out as a product roadmap
[JHS]
- Could think about these as key results - e.g. reducing metadata verbosity is a key result of improving scalability
[WVD]
- Have taken Picturae’s roadmap items and placed them in the timeline of Archivematica releases (Wim will share)
- Picturae and Artefactual have an understanding that most of their main priorities will be developed to be included in 1.12
- Roadmap was generated via survey to clients, mixed with the priorities for the implementation of Archivematica for the City Archives of Amsterdam and an assessment of feasibility of these items by the Artefactual team.
PAR core project
[JHS]
- OPF webinar next week - is currently full but members can watch recording
Meeting wrap-up
- Next meeting - July 15, 2020 at 3:00 p.m. UTC
- Agenda items for next meeting:
- Use cases for reporting