Tuesday, April 5, 2016

Data Science Studio Office Hours for Spring Quarter

As a reminder, the WRF Data Science Studio offers several types of drop-in office hours to meet the needs of those working in data-intensive science. The program brings together expertise from the eScience Data Scientists, UW libraries, UW-IT, and the Center for Statistics and the Social Sciences (CSSS) to help triage challenges in data-intensive science – including cloud computing – and steer people towards appropriate solutions. Assistance may be in the form of immediate help, a longer meeting with our team to understand the problem more deeply, or a referral  you to faculty on campus with relevant expertise.

Tuesday, March 29, 2016

Upcoming classes: Community Data Science Workshop, R + Stata

Several upcoming workshops and classes will be held Spring Quarter at the University of Washington, focusing on students needing R or Stata introductions, as well as another round of the popular Community Data Science Workshops. Details are below.


The Center for Social Science Computation and Research has posted their Spring Quarter classes, which includes Introduction to Stata, Introduction to R with R Studio, and Introduction to R with Commander. Students will learn basics software organization, where to find help, and how to get started with basic analyses. No previous experience in statistical programming is necessary, but basic understanding of statistics will be helpful. 


The Spring 2016 round of the Community Data Science Workshops are for anyone interested in learning how to use programming and data science tools to ask and answer questions about online communities like Wikipedia, free and open source software, Twitter, civic media, etc. The Spring 2016 series consists of one Friday evening and three Saturday sessions in April and May. The workshops are for people with no previous programming experience and, thanks to sponsorship from eScience and the Department of Communication, are free of charge and open to anyone.

Our goal is that, after the three workshops, participants will be able to use data to produce numbers, hypothesis tests, tables, and graphical visualizations to answer questions like:

- Are new contributors to an article in Wikipedia sticking around   longer or contributing more than people who joined last year?

- Who are the most active or influential users of a particular Twitter hashtag?

- Are people who participated in a Wikipedia outreach event staying involved? How do they compare to people that joined the project outside of the event?

Details and dates are online here:

If you are interested in participating, please fill out our registration at the link above before Saturday April 2. Register soon!

If you already know how to program in Python, it would be really awesome if you would volunteer as a mentor! Being a mentor involves working with participants and talking them through the challenges they encounter in programming. No special preparation is required. If you’re interested, there’s a link on the page above, or you can send me an email. If you mentored before, it’s still easier if you fill our form again. Thanks!

Mako (On behalf of Jonathan, Tommy, Dharma, Ben, Mika, and all the CDSW

Wednesday, March 23, 2016

Next Week! Digital Scholarship Focus Groups

In an effort to develop our digital scholarship program in the Libraries, we will be holding a series of focus groups with faculty and graduate students working in the sciences. Goals of the focus groups are to determine what types of digital scholarship research and teaching is currently being done in departments across campus and to determine what types of barriers (if any) exist in completing digital scholarship work. If you are working on digital projects or data visualization, we would love to hear from you! Faculty focus groups are March 29: 12:30-1:15pm, March 30: 10:30-11:15am. Graduate student focus groups are March 29: 10:30-11:15am, March 29: 2:30-3:15pm. You may sign up for focus groups here. We'll confirm your participation, send you the location and a list of a few questions we'll cover to help start the conversation.  Light refreshments will be provided for participants.

Thank you for your participation! Questions can be directed to Verletta Kern, our Digital Scholarship Librarian.

Tuesday, March 15, 2016

STEM Journal Publishing: What’s an Editor to Do?

Join the UW Libraries for a panel discussion from four UW faculty members who are also journal editors. Geared toward graduate students, post-docs and librarians, the panelists will address a variety of issues of interest to current and future authors, as well as librarians. Possible questions for discussion include:

 What do you do as an editor?
 How did you become one?
 Where do you fit in the hierarchy of your journal?
 What does it take to get published in your field today?
 What is the impact of the increase in manuscripts being submitted today?
 How is peer review handled with your journal?
 Have you run into ethical issues, and, if so, how did you deal with them?
 What are some of the most common mistakes made by authors?
 What advice would you give an author preparing to submit her/his first paper?
 How is digital accessibility attained?
 How to manage traditional papers augmented with other content such as video or audio content?

Our panelists include:
Valerie Daggett: Professor, Bioengineering
Jody Deming: Professor, Oceanography and Professor, Astrobiology
Richard Ladner: Professor, Computer Science and Engineering
Randy Leveque: Professor, Applied Mathematics

Session moderator:
Kelly Edwards: Associate Dean for Student and Postdoctoral Affairs, Graduate
School, and Associate Professor, Department of Bioethics and Humanities,
School of Medicine

Tuesday, April 12, 4:00-5:00PM; Reception, 5:00-5:30PM
Research Commons, Presentation Place, Allen Library South

Friday, February 12, 2016

Thursday, February 11, 2016

Love Your Data Week, Day 4: Data Citation

As stated on the Love Your Data site: "Data are becoming valued scholarly products instead of a byproduct of the research process. Federal funding agencies and publishers are encouraging, and sometimes requiring, researchers to share data that have been created with public funds. The benefit to researchers is that sharing your data can increase the impact of your work, lead to new collaborations or projects, enables verification of your published results, provides credit to you as the creator, and provides great resources for education and training. Data sharing also benefits the greater scientific community, funders,the public by encouraging scientific inquiry and debate, increases transparency, reduces the cost of duplicating data, and enables informed public policy."

Looking for some pointers on how to share your data? If you're at the University of Washington, you may already be sharing papers in the ResearchWorks Archive. The Libraries is in the process of updating the archive and working out how best to support data archiving on campus, so if you have data you want to preserve in the long-term, contact us to see if we can use your data as a test case as we build a new data repository.

If you're interested in learning more about how data citation impacts research reputation, Robin Chin-Roemer has a new book called Meaningful Metrics that serves as a guide to impact, bibliometrics, altmetrics as well as a few other topics. 

For today's activity, consider these "good practice" tips:

  • share your data upon publication
  • share your data in an open, accessible and machine readable format
  • deposit your data in your institution's repository to enable long-term preservation
  • license your data so people know what they can do with it
  • tell people how to cite your data 
  • when choosing a repository, ask about the support for tracking its use. Is a handle or DOI provided? Can the depositor see how many views and downloads the data has? Is the cite indexed by google, google scholar, the data citation index?

Wednesday, February 10, 2016

Love Your Data Week, Day 3: Help Your Future Self

By Help Your Future Self, we mean Write It Down: document, document, document! Your documentation provides crucial context for your data. So whatever your preferred method of record keeping is, today is the day to make it a little bit better! Some general strategies that work for any format:

  • Be clear, concise, and consistent.
  • Write legibly.
  • Number pages.
  • Date everything, use a standard format (ex: YYYYMMDD).
  • Try to organize information in a logical and consistent way.
  • Define your assumptions, parameters, codes, abbreviations, etc.
  • If documentation is scattered across more than one place or file (e.g., protocols & lab notebook), remind yourself of the file names and where those files are located.
  • Review your notes regularly and keep them current.
  • Keep all of your notes for at least 7 years after the project is completed.
Things to avoid:

  • Writing illegibly.
  • Using abbreviations or codes that aren’t defined.
  • Using abbreviations or codes inconsistently.
  • Forgetting to jot down what was unusual or what went wrong. This is usually the most important type of information when it comes to analysis and write up!
Today's Activity: If your documentation could be better, try out some of these strategies and tools:

Take a few minutes to think about how you document your data. What’s missing? Where are the gaps? Can you set up some processes to make this part of the work easier?

Using WinMerge to Manage Files and Folders

Post by Greta Pittenger, Data Services Specialist and MLIS student at UW iSchool

WinMerge is an open source tool for Windows that compares and (of course) merges files and folders. It uses side-by-side comparison windows and can create backups of files before you save what’s been merged. It also works integrated with some versioning applications and can create patch files and resolve conflict files.

I downloaded WinMerge at work and at home to give it a try. Confession: I have files from different projects stored in totally different, not multiple, places - bad! To try out WinMerge’s compare tool, I managed to find two folders from the same project that in theory should be the same - one on a department server and the other in Dropbox.


Looks pretty good! There are a few Thumbs.db (thumbnails) files that are either not identical or only in one folder and not the other - not something I’m too worried about. Looks like the only other file that is different is Share_Publish.docx. Double clicking on that line will bring the two files into their own file-comparison tab.


Wuh oh. Encoding issues… I’ll have to dig deeper into that. Suffice to say, I checked the files in MS Word and they looked identical visually, but noted in WinMerge that the most recent edit dates are different. Further investigation with the WinMerge manual is required, but I’ll save that  for another day.

To get a better idea of what the file comparison looks like when it’s most helpful, I looked at two XML files. Often, I’ll save an XML file just before trying something new out, and then save the changed file as a new version. This way, I can always go back to the old version if something in the new lines aren’t working the way I expect.

Screenshot (11).png

The differences (or, difference, in this case) show up in yellow upon first opening.

Screenshot (12).png

Clicking the Next Difference arrow takes me to the next, and only, difference for these files. The file on the right just has some added information within the brackets. I could merge these lines, or merge the entire files if more lines were different. I’ll leave them for now though.

Notice that the space taken up by the extra text on the right is represented on the left by a gray line. This keeps the rest of the documents lined up for easy side-by-side comparison. Here’s another quick example:

Screenshot (9).png

Again, the only difference is added text. The scroll bar in the Location Pane on the left side will show where all the differences in a document are, with the added gray space. The Diff Pane on the bottom brings up the current difference you are viewing to more completely see the lines.

This is a great tool for backing up files (note to self) and making sure most recent backups are up to date. Even if you don’t merge files, it can be used to tell you if you need to save a new version after changes have been made. One of the most useful things, in my mind, is the automatic addition of WinMerge to drop-down menus when right-clicking on files, as displayed below:

Screenshot (10).png

So simple.

WinMerge can be downloaded for free. Make sure to take a look at the Quick Tour to get some more tips and training on ways it can be used.

Tuesday, February 9, 2016

Using Bulk Rename Utility in a Digital Preservation Workflow

Post by Liz Bedford, Data Services Project Librarian

My first experience getting into the nuts and bolts of digital preservation has been working with the Preservation department here at UW Libraries to remediate digitized versions of rare books for uploading into the HathiTrust Digital Library. Each page gets its own file, so for any book I’m working on, we’re talking about 100-500 TIFFs and JPEGs.  It’s a finicky process, because Hathi has extremely high standards for the material they accept into the collection. Great for future readers! But let’s just say that if it weren’t for automation, there’s no way I’d still be able to type coherently by the end of the day.

One of the tools I’ve come to rely on is Bulk Rename Utility, an open-source file rename utility for Windows. It’s pretty much effortless to set up, has a straightforward and intuitive GUI, and offers a wide variety of ways to play with file names. I predominantly use the Numbering and Extension rename options, but that barely scratches the surface of the functionality.

Let’s look at a recent example. A book I was working on was comprised of what I knew to be well-formed TIFFs. But I was unable to open the files, because at some point the file extension was changed from .tif to .tif_original. After opening Bulk Rename Utility, I navigated to my folder using the left sidebar. The first column shows the current file name, while the second shows what the new name will be with the options you’ve selected below. After using Ctrl + Shift to highlight all of my files, I used the dropdown menu in the Extension option (number 11) to indicate that I wanted my extensions to be “Fixed” as tif.


After hitting ‘Rename,’ Bulk Rename Utility flashes an ‘are you sure’ warning, which, based on experience, I do appreciate:


Three seconds later, my TIFFs are TIFFs, and all is right with the world. (Aka Windows will now let me continue to do my job.)

As I said, I’ve also used Bulk Rename Utility to re-number my files in a similarly easy process. I’m sure I’ll find other applications for the software  - I’m particularly curious about the functionality that lets you insert new names from an imported text file - but for now, I’m thrilled it’s on my digital preservation toolbelt.

Love Your Data Week, Day 2: Data Organization

phd101212s_finaldocToday the focus of Love Your Data Week is data organization. We'll have two posts today, the first on the topic of naming, the second reviewing Bulk Rename Utility.

So, first: Data librarians at Penn State have written two blog posts on The Art of Naming Things, one of which focuses on the practice of creating logical element names in a dataset and functions in your code (among other things). The second post deals with naming schemes for files and directories.

Part of #LYD16 is a daily activity designed to both illustrate the concepts being discussed, and to give data users a place to start. Today's activity is to come up with a folder structure and/or naming plan. Tips from #LYD16 folks are:

If you don’t already have a folder structure and/or file naming plan, come up with one and share it. Some good practices for naming files are described below:

  • Be Clear, Concise, Consistent, and Correct
  • Make it meaningful (to you and anyone else who is working on the project) 
  • Provide context so it will still be a unique file and people will be able to recognize what it is if moved to another location.
  • For sequential numbering, use leading zeros.
    • For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100.
  • Do not use special characters: & , * % # ; * ( ) ! @$ ^ ~ ‘ { } [ ] ? < >
    • Some people like to use a dash ( – ) to separate words
    • Others like to separate words by capitalizing the first letter of each (e.g., DST_FileNamingScheme_20151216)
  • Dates should be formatted like this: YYYYMMDD (e.g., 20150209)
    • Put dates at the beginning or the end of your files, not in the middle, to make it easy to sort files by name
    • OK: DST_FileNamingScheme_20151216
    • OK: 20151216_DST_FileNamingScheme
    • AVOID: DST_20151216_FileNamingScheme
  • Use only one period and before the file extension (e.g., name_paper.doc NOT name.paper.doc OR name_paper..doc)

There are generally two approaches to folder structures. Filing, or using a hierarchical folder structure. The other approach is piling, which relies on fewer folders and uses the search, sort, and tagging functions of your operating system or cloud storage tools like Box.