Monday, March 23, 2015

UW Data Librarians to Present at ACRL

UW data librarians will be presenting in both a panel and poster session at ACRL 2015, both of which will be on the topic of research data management instruction.

At poster session 2 (Thursday, 3/26, 2-3pm in the Convention Center Exhibit Hall), Mahria Lebow and Jenny Muilenburg will be presenting results from their data management-focused session at 2014's Science Boot Camp West. "Using Active Learning Techniques to Engage Academic Librarians in Research Data Management" will illustrate the techniques they used to engage librarians in a non-introductory, 200-level research data management workshop meant to introduce attendees to RDM concepts in a hands-on way. Live polling and group work was used to generate questions, conversations and learning about various RDM topics.

The poll questions were a great way to both engage attendees and spark conversation, by letting audience members respond anonymously, while at the same time seeing how others in the audience were responding. Workshop attendees were quite positive in their feedback of the techniques used in the session, and in particular the polling section was effective. Poll questions are online at

On Friday morning (3/27, 8:30-9:30am, Room A105 in the Convention Center), Jenny Muilenburg, Amanda Whitmire and Heather Coates will present on a panel titled "Promoting Sustainable Research Practices Through Effective Data Management Curricula." This session will detail how each librarian developed a strategy for teaching research data management in different contexts. Each will address how they created their content, assessed their effectiveness, and plans for future directions.

And in case you missed it in a previous post, a full list of data management planning programs at ACRL2015 is available at

Wednesday, March 18, 2015

Today's Data, Tomorrow's Discoveries: NSF's OSTP response released today

The NSF OSTP response came out today. Here are a few choice tidbits from a quick reading:

"All data resulting from the research funded by the award, whether or not the data support a publication, should be deposited at the appropriate repository as explained in the DMP. Metadata associated with the data should conform to community standards and the requirements of the host repository. At a minimum, data elements should include acknowledgement of NSF support as well as the award number and appropriate attribution." pg 7

"NSF investigators typically have multiple funding sources. Since a given item may be based on funding from more than one agency, NSF expects to allow submissions of articles and papers to public access repositories operated by other Federal agencies that meet the standards of the OSTP February 22, 2013, memorandum and for which the investigator can provide a persistent identifier as an element in annual or final reports." pg 13

"In collaboration with other Federal agencies and interested parties, NSF will develop criteria for eligible repositories, based on the criteria set forth in the OSTP memorandum, and will provide appropriate guidance for awardees and investigators on the website.
NSF may initiate these discussions as early as FY 2016." pg 14.

"Rarely does NSF expect that retention of all data that are streamed from an instrument or created in the course of an experiment or survey will be required." pg. 15

"Over the next three years, NSF will consult with the community and with other Federal agencies and facilitate the establishment of standards for metadata and repository systems." pg 16

"NSF is aware that individual publishers and library systems are experimenting with new approaches to presenting information, linking publications to data, and providing pointers to repository systems. NSF proposes to foster these developments and their use by ensuring consistent and predictable access to the underlying information, thus providing a platform for creativity and innovation." pg 18.

There's much more in the full text, available here: It deserves a read if you have time!  NSF's Executive Summary of the plan, which is only two pages, is here:

Tuesday, March 10, 2015

ACRL 2015 Research Data Management Programming

ACRL 2015 is coming up fast, and it's never too early to plan out your conference schedule. While research data management is not a heavy focus of ACRL (as compared to, say, Teaching & Learning), there are still several panels, poster sessions, and roundtable discussions on RDM and related issues, as well as a full-day preconference on setting up data management services. Unfortunately, of the four panel sessions on RDM, two are concurrent, but there is definitely enough to keep you busy.

Items here fall under several topical categories, including Scholarly Communication, Teaching & Learning, Assessment, Technology, and others. We tried to capture all data-management related items here, but if you notice something missing, please let us know in the comments.

The full list is available here:

Monday, February 23, 2015

Research Data Management & Physics/Astronomy Librarian Office Hours

The eScience Institute and the UW Libraries are pleased to announce Research Data Management & Physics/Astronomy Librarian Office Hours in the WRF Data Science Studio.

Physics/Astronomy Librarian Hours: 1-3p Mondays and 10a-12p Thursdays
Research Data Management Hours: 11a-1p Tuesdays and 1-3p Thursdays

Location: WRF Data Science Studio, 6th floor Physics/Astronomy Tower (map)

During each two-hour slot, librarians will be on hand to provide support and guidance in their relative areas of expertise.  This includes support for finding and accessing data, data management planning, data organization, reuse of data, data sharing and storage, data citation, instruction, literature review, publications, citation management tools, physics / astronomy / mathematics research, and more.

Some representative questions we have helped with in the past:
  • The funding agency for my grant requires me to share my data.  What are my options?
  • Can you help me prepare a data management plan for a grant proposal?
  • Are there standards in my field I should be using to describe my data?
  • I’d like to get a DOI for my dataset to include in a journal publication.  Can you help?
  • What can I do to keep track of my HEP citations? I need to keep projects separated.
  • I’m looking for a cosmology paper presented at a conference last year. Does the library have it?
  • How can I access Journal of Physics G: Nuclear and Particle Physics from home?

Tuesday, February 3, 2015

Data Librarianship Educational Resources

Last year I had the opportunity to take several online training courses related to data librarianship and data science, several of which are being repeated this year or are ongoing. For those looking for beginner-level information, these resources can be very helpful in understanding what data management is, how the library can and should be involved, and what it means to be a data librarian (a difficult-to-define term at best). I've also included a few non-course resources that may be of interest. If you have additional resources you'd like to see on this list, let me know in the comments.

So, to kick off 2015 with some educational resources, here's what's covered below:

  • Research Data Management, Library Juice Academy
  • What You Need to Know About Writing Data Management Plans, ACRL
  • Essentials 4 Data Support, Research Data Netherlands
  • Data Scientists Toolbox, Coursera
  • Data Information Literacy, book by Carlson and Johnston
  • The Mendeley group Data Management for Librarians

Class: Research Data Management
Source: Library Juice Academy
Instructor(s): Jillian Wallis, UCLA
Format: scheduled online class
When: March 2-27, 2015
Cost: $175

Taken from the course description, the purpose of this class is to "explore the processes of data production and data management, and the role of LIS professional and institutions in supporting data producers." The class is geared toward academic librarians, but is open to anyone. It covers the following topics:
  • The role and lifecycle of research data
  • Stakeholders and stakes in data management
  • Data sharing and data reuse
  • Data selection and appraisal
  • Repositories and registries
  • Data management standards
  • Tools for writing funder-required data management plans
  • The role of institutions and institutional libraries
Participants read up on current policy and research, and prepare a DMP or data policy or something similar as a final project. There are a lot of readings, and although the Library Juice website says there is approximately 15 hours of work for a four-week course, the instructor's introductory email said to expect each week's work to take at least 8 hours, taking the expected workload from 15 to 32 hours. I found that I was indeed spending 8-10 hours a week to complete the readings and assignments, and stay on top of the course forums. It's possible they've lightened the reading load for this year, but be prepared.

The 2014 technology was a bit buggy: sometimes readings were popups, sometimes a download, sometimes you were taken to a new page. Technical support is iffy -- when I asked for assistance locating two PDFs that were referenced but not linked, I was told I should be able to find them online. I think a class geared toward working professionals should have all the readings immediately available.

There is a strong background given in technical information about workflows and the data lifecycle and all its variations, much of which is looked at from the academic side of things (the instructor has a PhD in information science and teaches in the Information Department at UCLA), rather than that of a practicing researcher. Some of this may be old to a practicing data librarian, or it may be that the theoretical underpinnings of data management are of lesser importance to a librarian who is trying to help develop DMP consolations for researchers, but the background is helpful to understand current policy and practice for various funding agencies and archives. And working on the final project with peers and the instructor available for help is very useful to someone new to DMPs.

Class: What You Need to Know About Writing Data Management Plans
Source: ACRL
Instructor(s): Dee Ann Allison, Professor, University of Nebraska-Lincoln; Kiyomi Deards, Assistant Professor, University of Nebraska-Lincoln
Format: scheduled online class
When: April 27 - May 15, 2015
Cost: varies, $60 for a student, up to $195 for non-members 

This course is focused specifically on DMPs, with a little background on data management concepts in general. Learning outcomes from ACRL: 
  • List specific data depository resources in order to formulate recommendations for researchers to securely deposit and share their data.
  • Learn about how different funding agencies, and departments within those agencies, have different requirements for data management plans in order to determine how to effectively advise each researcher according to the requirements for their specific plan.
  • Analyze sample data management plans in order to develop an understanding of what constitutes a thorough data management plan.

Topics covered include data and metadata definitions, open data formats, dark archives, repositories, long-term preservation for data and sharing strategies. The course forums for this class were active, and strongly relevant to the weekly readings and assignments. The final project (for my group) was to develop a DMP for a project one of us had been working on or with, and it was very useful to be able to see a real-life example, rather than a case study. Sample DMPs were also evaluated from various disciplines, giving some good examples of variety across fields.

This class is much more aligned with the needs of practicing librarians who need education on what a DMP is and how to construct one. Most in my cohort were other academic librarians with varying levels of experience; this was helpful when we were put in groups for our final project, as each student brought different skills to the table, and we could all benefit from each other's expertise.

There were again some bugs: lots of typos throughout materials, PDFs that opened but disrupted the navigation of the class, a few problems the first week with assignments that couldn't be uploaded. As 2014 was (I believe) the first year this particular course was offered, I would hope that some of these issues have been worked out.

The final group chat for the course was a good place for last thoughts, as well as for shared resources either discovered during the class, or information people use in their own work. This final chat was shared out via email to students, which was great for those who couldn't attend the last virtual class meeting.

Class: Essentials 4 Data Support
Source: Research Data Netherlands
Format: self-paced online class
When: anytime/ongoing
Cost: three levels: free for online class, free with registration for class + forums, $ for in-person workshops (if you're close to Delft)

This class is perfect for those who need to know more about supporting those working with research data, but don't necessarily need or want a class with readings and homework (which, btw, can be necessary sometimes to make yourself do something!). It's particularly geared toward data librarians, IT staff, and researchers -- anyone with responsibility for data management. A list of competencies the course is meant to address is available here, but in general, the class was developed to teach "the basic knowledge and skills (essentials) to enable a data supporter to take the first steps toward supporting researchers in storing, managing, archiving and sharing their research data." 

For practicing librarians who need to get up to speed on data management, this is the place to go. It assumes a common background knowledge, yet presents information on data management in a simple and direct way, with additional resources and readings if needed. No special software is needed, it's a very simple and well-designed website, and it's easy to dip into the topics you need to know, leaving the rest for later.

There are six sections to the course, each of which provides an overview and objectives, the content of the section, and additional resources and readings. If you provide your email address and register, you're also able to participate in the forums (though most of the comments are in Dutch). Activities are included for some sections, all reading links are provided in the text, and no single page is too long, meaning students can come in and out of the course as time allows. It's a great source to provide information for librarians new to RDM and/or DMPs, and is useful as background before additional in-person discussion or instruction at your local institution. 

Class: Data Scientists Toolbox
Source: Coursera
Instructor(s): various, Johns Hopkins Bloomberg School of Public Health
Format: scheduled online class, part of the Data Science Specialization series
When: many start dates, usually monthly 
Cost: free unless you want a certificate ($29)

This class is the first of 9 classes (plus a capstone) that are part of the Coursera/Johns Hopkins Data Science Specialization. This first class is a good introduction to what "data, questions and tools that data analysts and data scientists work with." The class is divided into two parts, the first of which is a basic introduction to what a data scientist does. The second is an introduction to the some of the tools of the trade, including markdown, git, R, GitHub, etc. If you're new to data librarianship are need to be able to understand what your researchers are doing, this will give you a broad understanding of what data scientists do, and will help you understand a bit more about data sharing and open science.

A few additional educational resources: 

  • Data Information Literacy: Librarians, Data, and the Education of a New Generation of Researchers, by Carlson and Johnston. Published in late 2014, this book looks at what role librarians can play in helping a new generation of graduate students in STEM disciplines develop the competencies needed to manage research data. Material in the book comes from the work done by the authors and others for an IMLS-funded Digital Information Literacy project. 
  • The Mendeley group Data Management for Librarians, owned by Kevin Read, is a place to share literature and resources about data management, curation, citation, sharing, etc.
  • is a collaborative blog started in late 2014 aimed at sharing "resources, tips, conversations and strategies so that all of us can more effectively bring data resources to the people in our library communities." It's had a handful of posts from data librarians at different stages of their career, and is hoping to draw a larger audience via contributed posts. With the right participation, this could become a useful resource.

UW All-Campus Reproducibility Seminar: 2/10 @ 1:30pm

Ben Marwick, Assistant Professor of Archaeology, joins the All-Campus Reproducibility Seminar Series on February 10 at 1:30pm in the WRF Data Science Studio Meeting Room, 6th floor Physics/Astronomy Tower. He will give a talk titled: 

"Doing Reproducible Research with Docker"


A key obstacle to reproducible research that I frequently encounter when working with students and collaborators is keeping the toolkit simple, with managing dependencies being an especially time-consuming challenge. Virtual machines are one solution to these problems, but remain less than ideal because of relatively long start-up and shut-down times, their large size and performance demands, limited portability, and the need for the user to be familiar with a different desktop environment, amongst other concerns. In this talk I introduce Docker, a free and open source Linux container tool recently popular amongst commercial DevOps workers that provides lightweight virtual environments on Windows/OSX/Linux systems and has several advantages over regular virtual machines. I will describe the key elements of doing reproducible research with Docker and demonstrate dockerfiles, containers, images and registries (bring your laptop and follow along! If your using Windows/OSX then be sure to install in advance). I will show how these help with dependencies and keeping things simple, especially when working with R or Python.

Please join us!

Monday, January 12, 2015

Special Journal Issue Focuses on Data Literacy and Librarians

Time for some reading: the latest issue of the Journal of eScience Librarianship focuses on the role of librarians in data literacy. Included are articles on data management education initiatives, designing RDM curriculum for librarians and graduate students, as well as some case studies from different institutions that used the New England Collaborative Data Management Curriculum in order to teach RDM to various constituencies.

Also featured is an "eScience in Action" piece titled Lessons Learned from a Research Data Management Pilot Course at an Academic Library, from the UW's own Mahria Lebow, Jennifer Muilenburg, and Joanne Rich, detailing their experience teaching a research data management course to graduate students in early 2014.

We're hoping to set aside some time to read through these articles in the next few weeks, and will hope to include some reaction here. Stay tuned!

Friday, January 9, 2015

DRUW: a glance under the hood

As promised, here is the blog post about the technologies we are going to be playing with to build our data repository.  When we decided we wanted to pursue developing an institutional data repository we evaluated different pieces of software, weighing variables like maturity of system, the presence and type of community behind the system, flexibility for handling different object types and general future-proofedness.  There isn’t much of a dramatic pause for me to insert here, as we’ve already written in previous posts that the outcome of this analysis was going with Hydra.

But what is Hydra? Hydra isn’t a single thing  - an out of the box solution (though the community around it has set this as a future goal) -  rather it’s a framework of different pieces of software, that come together to create an institutional repository.   A Hydra installation can be used as a single interface to many different repositories, if we wanted to expand beyond the current scope of research data.  Hydra is based on Fedora, the repository platform from DuraSpace, a nonprofit that supports a number of open source technologies related to digital assets (like DSpace and VIVO).  Fedora is short-hand for Flexible Extensible Digital Object Repository Architecture and as its long-form name implies, Fedora is a digital asset management system capable of handling content regardless of type (GIS, A/V, images, text, data, etc).  Of note, DuraSpace recently has released Fedora 4, which has some significant changes from Fedora 3, including being happier about ingesting larger files and by default providing RDF representation of content and relationships.  The Hydra community is energetically working away at getting all of the pieces of the Hydra environment to play nicely with Fedora 4, and has advised that new adopters of Hydra to plan on using Fedora 4 from the get go, rather than create a situation that requires migration at a later date.  So, we’ve had a bit of good luck here on our timing for jumping in!  

So, Fedora is in charge of managing the objects, the other core components of a Hydra build include Solr and Blacklight.  Solr is an open source search platform from Apache that indexes the repository content. Blacklight is the discovery interface that plugs into Solr and provides features like (customizable) faceted browsing, exporting results and saving search history.  Now, those are just the core technologies, there are many other packages of code (referred to as gems in world of Ruby - the programming language behind Hydra) necessary to get an instance of Hydra up and running.  The community has developed several different flavors of Hydra that leverage this framework of technologies in deployable web applications (technically, Rails engines), the one we’ve elected to go with is Sufia.  

We’ve been working on use cases for our repository and our next steps are to define project phases, with realistic timelines and set milestones for each of these phases.

Tuesday, January 6, 2015

Data Librarianship Workshop for UW Libraries staff: Archives & Repositories

There are so many archives and repositories out there it can be difficult to know where to start looking to help someone in your field (or especially a field you’re not familiar with). This workshop, to be held Wednesday, January 28 from 2-3:30pm in the Allen Auditorium, will look at some of the categories of archives and repositories, and we’ll have time to share some of the similarities and differences across disciplines. We’ll also talk about some of the usage and ethics considerations that come into play when researchers share their data.

The workshop is open to all Libraries staff. Prior to the workshop, please identify 1-2 repositories in your subject area. Take 5-10 minutes and explore:
  • How easy it is to search for data
  • How easy it is to deposit data
  • What the depositor policies are
  • What kind of metadata the repository collects
  • Other general impressions

A good place to start (other than google) is

This workshop is the second of three workshops on data librarianship. The third will be held Wednesday, April 29th from 2-3:30pm in Allen Auditorium, and will focus on data management plans.

Questions can be left in the comments below.

Tuesday, December 9, 2014

DRUW Gets Going

As mentioned in our previous blog post we are developing an institutional data repository here at UW. We are joining the Hydra community and building our digital repository using the Hydra framework, which pulls together various components and platforms, including Blacklight, Solr and Fedora. More about the technologies in a later post! This project is a partnership between the Libraries and UWIT, the data will live on UWIT’s lolo filesystem.

How did we get here? A few years ago, the Data Services team conducted a survey 323 campus researchers, found that a strong need on our campus was for a place where researchers could store their data for the long term. This, coupled with funder mandates for providing public access to data meant that providing a data repository service at UW just made sense. Luckily, the Libraries administration agreed with us!

Since getting the go-ahead on the project, the majority of our time has been spent on (other than sorting technologies - to be discussed later) ensuring that we make a system that people are going to want to use and that meets their needs. The best way to do this, of course, is to figure out what those wants and needs might be. For this, we used a couple of approaches. First, we had two different standing library committees, the Data Services Committee and the Metadata Interest Group, create user stories. User stories are a technique from agile software development for defining system requirements from the perspective of the people who will use the system. There are different ways to write them, we chose to create each of ours from the skeleton sentence: “(a user type) wants to (their want) so that (why they want it)”. An example user story that was generated from this exercise: “A data depositor wants to not have to contact a librarian to upload a dataset, so that depositing can be done when they want to.” This particular story lead to a desired system feature for self-deposit of datasets. Our most common user types were “Data depositor,” “Researcher” and “Librarian.”

While these were being developed focus groups were held, which brought together researchers from across campus to discuss what they would want out of a data repository. Specifically, the questions asked of the focus groups were intended to identify potential barriers to use, so that we can be aware of those from the beginning and do our best to minimize or eliminate them. These conversations were summarized and then further distilled into the user story format. In total, 71 unique system features were identified. We are currently working on prioritizing the different features, determining what features we have the capability to include now, and what we can perhaps work towards in a future development phase of the repository project.