Digital Repositories Committee : 2022-05-02 Meeting notes

(blue star) Date

(blue star) Participants

(blue star) Discussion topics





Annual meeting update

Carmen Mitchell (Unlicensed)

  • Committee accepted all proposals and confirmations are going out now. Draft agenda should be completed in the next two weeks.

Working groups restructuring

Nicole Shibata (Unlicensed)

Now that the DAMS pilot is officially under way, the MWG is tasked with overseeing metadata work for two systems - Scholar Works and the DAMS.  I had a conversation with Melissa recently and we both agreed that it's probably time to re-think the group's priorities and more broadly, consider restructuring both DAWG and MWG to better support the two systems.  

This idea was initially brought up at the last DAWG meeting when we were talking about next steps for the group. While DAWG seems to be at a point where they are looking for new projects/direction, MWG needs more support on the DAMS side and currently, most of the members are Scholar Works-focused.  It seems to make the most sense to keep the MWG focused on Scholar Works and maybe even consider a broader charge for the group, similar to DAWG now that all the campuses are migrated over.  Similarly, DAWG can take on the metadata responsibilities for the DAMS system, particularly now that we're in the pilot phase.  There seemed to be broad interest in this idea among DAWG members.  Also, considering Melissa and I were already splitting the co-chair responsibilities based on the two systems, this seems to be a natural progression for the MWG.

Carmen to email COLD and let them know we want to shift the Working groups to DAWG and a ScholarWorks/IR working group. Will also email the folks who volunteered for the MWG to ask them if they would be willing to serve on an IR Working group. If there is attrition, we can do direct recruitment.
The Working groups would then include metadata in their charges.

New committee/Working group members

Carmen Mitchell (Unlicensed)

Invite the new members to your June meetings, please. Even though the new terms start in July, it would be good to have a “transition” meeting where you can vote on the chair and set up the new meeting times. (If the old meeting time isn’t convenient for all the new folks.)

Annual report prep

Carmen Mitchell (Unlicensed)

Could the WG chairs, Dave, and Carmen plan to compile their end-of-year updates by June 10th? This will give us some time to circulate the report and send it to cold. Previous examples are available.

Information item:

Tesseract OCR

From Mark Bilby

Just read up for the first time the Tesseract 5.0 open source OCR solution after seeing it used in recent Internet Archive digitization batches. We’re about to gear up to digitize thousands of retro theses and had considered using ABBYY for this piece of the workflow, but it looks like Tesseract may be the better way to go. Made me wonder whether other CSUs have ever used it, and if not, whether it might be worth exploring and providing some training. Another approach might be for one campus or the CSUCO to spin up a virtual machine once a week or once a month and run through a batch task where all pdfs or image files from an input group of folders in cloud storage are processed and saved to an output folder.

Update from Mark: Internet Archive also has free OCR and multi-format services for partners:
”We offer free OCR for any texts uploaded to now - just upload the PDFs from your retro theses and our standard derive process will return OCR (via Tesseract) along with our other post-prod outputs. We can get you set up with a collection for the theses & access to our command line tool (if you don't have it already) for batch processing. Happy to connect you with the right folks if you're interested in pursuing!”
Contact Mark if you are interested.

Open Forum feedback?

Dave Walker

Could we use the open forum on Friday for feedback on the work forms? If there is something else that needs to be discussed, please email Dave.

(blue star) Action items


(blue star) Decisions