Data @ Libs: February 2017

Tuesday, February 21, 2017

UW Data Science Seminar: Kelsey Jordahl

Wednesday, February 22, 3:30 p.m. in Johnson Hall 102

Kelsey Jordahl, Mosaics Team Lead at Planet Labs, will be presenting “Mosaicking the Earth Every Day” at tomorrow's Data Science Seminar. The Data Science Seminar is free and open to the public.

Abstract

Planet Labs currently operates about 60 Earth observation satellites imaging 50 million square kilometers of land area per day. We plan on tripling those figures in coming months, fulfilling our Mission 1 to image the surface of the Earth every day. Global mosaics are created from these images at regular intervals (quarterly, monthly, and weekly) by selecting the best quality scenes (e.g. cloud- and haze-free), color balancing, and seamlessly compositing millions of scenes to create continuous maps of the Earth for each time slice. As our data rate increases, we plan on scaling up the cadence of our mosaics, including a building a continuously updated "dynamic" mosaic of the most recent cloud-free images of the Earth. Daily data at 5 meter spatial resolution will open up new analysis techniques previously limited by the temporal or spatial resolution of existing instruments.

Friday, February 17, 2017

Love Your Data Week, Day 5: Rescuing Unloved Data

How do data become unloved? We data users don’t love data that are messy, poorly documented, incomplete, or unwieldy, to name just a few frustrations. However, one important way that data become unloved is that they are just plain old. Older data tend not to be machine-readable, which can pretty much be the kiss of death. Digitization, while it’s improving, is still somewhat labor-intensive and costly, so unless a data set is obviously worth the trouble, it may languish.

However, researchers are starting to explore whether there may be some hidden gems worth rescuing. One area in which this is happening is climate data, and a great example is the Glacier Photograph Collection from the National Snow and Ice Data Center (NSIDC). Before this collection was digitized, users had to travel to the NSIDC in Colorado, ask staff to find physical images or microfilm for them in the collection, and then deal with those physical artefacts. Not surprisingly, the collection had few users. However, digitizing these photographs -- which can be considered data sources, as they contain information that can be analyzed -- has made them not only accessible, but an important resource for documenting changes in glacier size and coverage. Digitizing some of the old photographs also suggests locations for repeat photographs from the same vantage point, which can indicate changes across time periods.

PHOTO: Left: William O. Field, 1941; Right: Bruce F. Molnia, 2004. Muir Glacier: From the Glacier Photograph Collection. Boulder, Colorado USA: National Snow and Ice Data Center. Digital media.

But using the above example is cheating a little bit; these photographs were unloved because they were undigitized, but it was clear that they were worth digitizing. In fact, it was so clear that NSIDC was able to get funding and enter into partnerships to get that work done. So what if a researcher has a great idea, but needs sheer person-power to bring it to fruition? These days, crowd-sourcing may do the trick! Check out the Swiss project Data Rescue @ Home, in which citizen-volunteers are entering German climate data collected during WWII, and also have completed entering data from a weather station in the Solomon Islands collected in the early to mid-1900s. By January 2014, they reported having digitized 1.3 million values! They note: “The old data are expected to be very useful for different international research and reanalysis projects…[for example,] historical weather data from the Azores Islands are particularly valuable since the islands are located at the southern node of the most important climatic variability mode in the North Atlantic-European region, the so-called North Atlantic Oscillation (NAO), and there are not much other historical data available from the larger region.”

PHOTO: Example of data collected in the Solomon Islands, entered electronically by citizen-volunteers of the Data Rescue @ Home project (Accessed 2-13-17).

Interested in getting involved in a citizen-science project yourself? Here’s a list of possibilities! And if you really get hooked, you may want to dive into some collections of older non-digitized data and consider starting your own project, to rescue the unloved data and give them new life.

OK, I’m off now to figure out how to get on the project where I can hang out on the beach in New Jersey and count horseshoe crabs!

Ann Glusker PhD MPH MLIS

Research and Data Coordinator

National Network of Libraries of Medicine, Pacific NW Region
University of Washington Health Sciences Library

Thursday, February 16, 2017

Love Your Data Week, Day 4: Finding the Right Data

Welcome to Love Your Data Week, Day 4: Finding the Right Data. Today's theme is about asking the right questions, finding the right sources, and citing accordingly -- all of which will enable you to locate the right data, as well as enable your audience to also see why you chose the data you did.

Our friends at the National Network of Libraries of Medicine/Pacific Northwest Region, have taken today to highlight the new DataLumos initiative from ICPSR at the University of Michigan. This project aims to archive government datasets to ensure their preservation into the future. Check out their post on the Dragonfly blog describing this and other data archiving work happening around the country.

Wednesday, February 15, 2017

Love Your Data Week, Day 3: All’s FAIR in Love and Data Management

Welcome to day three of Love Your Data Week 2017! Today’s topic is Good Data Examples. What makes data “good” or “well managed?” The FAIR Data Principles: —Findability, Accessibility, Interoperability, and Reusability are a good place to start. Published by Mark Wilkinson and his colleagues in 2016, these principles “put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”¹A brief description of the principles, excerpted from Wilkinson’s article, explains:

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data

To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards”²

These guiding principles benefit all stakeholders, including, as Wilkinson states, “researchers wanting to share, get credit, and reuse each other’s data and interpretations; professional data publishers offering their services; software and tool-builders providing data analysis and processing services such as reusable workflows; funding agencies (private and public) increasingly concerned with long-term data stewardship; and a data science community mining, integrating and analyzing new and existing data to advance discovery.”³

Wilkinson identifies several examples of FAIRness, including Dataverse, FAIRDOM, and Open PHACTS, and notes that the FAIR Guiding Principles have been adopted by a wide range of data management organizations across the globe.

^1-3Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018. doi: 10.1038/sdata.2016.18. PubMed PMID: 26978244; PubMed Central PMCID: PMC4792175.

Tuesday, February 14, 2017

Love Your Data Week: Day 2 - Documenting, Describing, Defining

Today’s topic is “Documenting, Describing, Defining”, and so we’re taking the opportunity to highlight a platform that can help streamline those processes.

We are happy to announce that UW has become an Affiliate of the Open Science Framework! UW staff, students, and researchers can now create OSF accounts using their NetID through the “Login through your institution” pointer on the Sign Up page. Not only is OSF a fantastic tool for data management and data sharing, it’s also a tremendous resource for keeping organized throughout the research process.

In a nutshell, OSF is like Github for workflows, except it can also serve as command central for all of the bits and pieces of your work that you’ve spread over Amazon S3, Github, Google Drive, Mendeley, and elsewhere. It is an open source, cloud-based project management platform, designed to help teams collaborate in one centralized location. Teams can connect third-party services that they already use for both storage and reference management directly to the OSF workspace. With version control, persistent URLs, and DOI registration, OSF is a powerful tool for enabling reproducible research practices.

Anyone can create an OSF account, so collaborating with people outside your institution is easy. You have fine-grain control over who has access to your project – or even individual components of your project. So OSF can serve as both the sharing platform you use for externally-focused materials like data sets and preprints, but also the secure workspace you use to keep track of internal materials like analysis protocols and manuscript drafts. (A caveat: OSF is not HIPAA compliant, so you shouldn’t upload or link to your sensitive data.)

OSF is also a great tool for teaching reproducibility, allowing instructors to not only guide the shape of their students’ projects, but also to keep tabs on how successful students are in their workflow and data management efforts. If you’d like more information on how to use OSF in the classroom, this is an excellent presentation.

We are big fans of OSF here in Research Data Services, and we encourage you to check it out. This only scratches the surface of OSF’s capabilities, so if you’d like to learn more you can visit their extensive Help section, or contact us at libdata@uw.edu.

Monday, February 13, 2017

Love Your Data Week: Day 1

Welcome to Love Your Data Week 2017! Organized by Research Data Specialists from several academic institutions, Love Your Data Week aims "to raise awareness and build a community to engage on topics related to research data management, sharing, preservation, reuse, and library-based research data services." Each day we will highlight a different aspect of research data management. Follow our blog, check our Twitter page, and look for the hashtags #LYD17 and #loveyourdata for new and interesting data-related insights all week!

Today's Topic: Defining Data Quality

Today we are highlighting the work of a University of Washington research lab, to demonstrate how one group of researchers define data quality.

Loma, Kaeli, and Jorge from the Avian Conservation Laboratory in the UW's School of Environmental and Forest Sciences kindly agreed to answer a few questions about data quality in their field of research. Let us know your experiences with data quality by tweeting with the hashtag #LYD17 to @UWLibsData.

Provide a brief introduction to yourself and your lab/team:

Kaeli: "I study the behavior of crows around dead crows (ethology/thanatology). Most other people in my lab also work on birds, but our individual studies, areas of research and methodologies vary greatly."

Jorge: "I'm an international student from Chile working on the Avian Conservation Lab of John Marzluff at the School of Environmental and Forest Sciences."

What does data look like in your area of research?

Kaeli: "My data is generally measurements of time (x seconds spent doing a particular thing or in a particular place) binary measurements (did or didn't something occur) and count data such as the number of birds present or the number of times an action occurred."

Jorge: "I have many different kinds of data. I have spatial data that includes locations and attributes of certain aspects of what individual animals I studied did on such places. I also have data on abundance of different bird species on the greater Seattle area."

The message for today is: "Data quality is the degree to which data meets the purposes and requirements of its use. Depending on the uses, good quality data may refer to complete, accurate, credible, consistent or “good enough” data." How would you define quality data in your field? Are there any standards for assuring data quality? How do you and your fellow researchers distinguish between quality data and questionable data?

Loma: "I've never thought of this before. I would assume that directly observable quantitative data would be considered better quality than qualitative data."

Kaeli: "This is actually a really hard question. It would probably be really difficult for me to just look at someone's data and determine if it was of poor quality. Perhaps if I was looking at their raw data sheets and noticed a lot of missing information, but otherwise the devil is in the methodological approach not necessarily the data itself. So I would question the data if say all but two data points were collected at a very specific time of day. Any standards for collecting quality data really come from both your field of study and what statistical methods you plan to use."

Jorge: "For me, quality data is representative and unbiased. The typical standards have to do with the quantity of data to be able to perform relevant statistical tests, and the training of the people that collected the data."

"For me, it's not intuitive to detect bad data. Sometimes you see patterns emerge that don't match what is expected, and that may help, but otherwise it is not that easy."

How did you decide what to measure and how to gather the data in your research?

Loma: "I created a hypothesis for the question I was trying to answer, then thought about what I could measure that would allow me to refute or fail to refute that hypothesis. For example, I'm currently trying to figure out what certain vocalizations mean to a crow, so I measured a number of behaviors that are indicative of agitation, fear, aggression, and curiosity. That way, I can compare how often a crow gives those behaviors both before and after I play a certain call through a loudspeaker."

Kaeli: "I mostly make it up as I go along. Which is kind of a joke and kind of not. Often I design and experiment based on what I think the most meaningful or robust measure of my question will be, but then once I get into the field I find out that doing it that way is actually impractical or impossible so I need to change it. So often in wildlife studies the answer to that question is that we try our best to guess what will work but ultimately we're at the mercy of our study animal and the elements."

Jorge: "I asked what would it be relevant to measure for the biological questions I was going to ask and what was it feasible to collect, given my logistical and budgetary constraints."

Do you have processes in place for maintaining your data for future use and sharing?

Loma: "I have my data backed up on two hard drives and the department cloud storage, and I'm willing to share it to anyone who asks so long as they convince me that I'd be included as an author/contributor on whatever they're working on."

Kaeli: "Yes but I don't really use them, [to be honest]. I back up all my data on my computer, 3 hard drives and in dropbox. We're supposed to also back them up on our lab's server but I hardly ever do this!"

Jorge: "I keep my data on several places (like the data sheets where I collected it, and different hard drives) to ensure it's safety. I'm not planning on sharing my data at this time."

Thank you Loma, Kaeli, and Jorge for sharing your experience with data quality. We wish you the best of luck in your research!

Wednesday, February 8, 2017

Lecture: Robert Kosara on Data Visualization

How Do We Know That? Robert Kosara on Data Visualization
Tuesday, Feb. 21, 2017, 7 p.m.
Kane Hall: Room 110

The School of Art + Art History + Design and the Simpson Center for the Humanities present a free lecture by Robert Kosara, research scientist at Tableau Software, on data visualization.

Lecture Description from Events Calendar:
We know some things about data visualization, and we don’t know others. But it turns out that many of the things we think we know, we actually don’t. Much of what we believe about charts and visualization is based less on evidence and well-constructed science than we like to believe. Are bar charts always the best choice? Are embellishments bad? Are pie charts really evil? Is animation always distracting? When might it work?

Instead of well-run experiments and real evidence, many supposed rules are based on opinion, aesthetic judgments, and incomplete or oversimplified studies. You wouldn’t exactly know that from the level of conviction with which these things are often stated. In this talk, I will show you that some of the things we just assume to be true are actually wrong, many we don’t know about, and some that are, in fact, correct. But more than that, I want to draw your attention to the fact that there are many things we don’t really know – and show you how important it is to ask, how do we know that?

Speaker Information:
Robert Kosara is a research scientist at Tableau Software. His focus is on the communication of data through visualization and visual storytelling. Before joining Tableau in 2012, Robert was Associate Professor of Computer Science at the University of North Carolina at Charlotte. Robert received his MSc and PhD degrees in Computer Science from Vienna University of Technology (Austria). His blog, eagereyes.org, is one of the most popular and respected resources on data visualization.

For more information and to RSVP, please visit the Event Page.

Tuesday, February 7, 2017

New Book: The Practice of Reproducible Research

Interested in reproducible research practices? We have exciting news from Justin Kitzes of the Berkeley Institute for Data Science:

"I and several colleagues have just released the open, online version of our new book, "The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences" (to be published in print by the University of California Press) -

http://www.practicereproducibleresearch.org

"The book is based around 31 case studies of research workflows, contributed by academic scientists and engineers from a variety of disciplines, in which each author describes the key practices, tools, and methods that they used to try to make their research as reproducible as possible. . . If reading 31 case studies sound like a bit much, we've also written a set of summary chapters (Part I of the book) that provides a basic overview of reproducible research and synthesizes lessons learned from across the contributed case studies."

Additionally, if you are looking to connect with other researchers interested in reproducible research practices, be sure to check out the eScience Insitute's Reproducibility and Open Science Working Group.

Search This Blog