GLAMpipe

metadata processing tool for culture historical data

View project on GitHub

This is a project by Wikimedia Finland and no code is yet available. Things are changing and different routes are followed. However, this page tries to show the main ideas. Any feedback is welcome (ari.hayrinen@gmail.com).

The project is mainly funded by Finnish Ministry of Education and Culture and partly by WMF via Wikimaps Warper 2.0 project.

GLAMpipe in action.

Purpose

The main purpose of GLAMpipe is to provide tool with graphical user interface for data manipulation and conversion , both scripted and manual ways.

GLAMpipe uses nodes for graphical user interface. With nodes it is possible to make "easy to follow" structure for all editing that is needed for a dataset. The screenshot above shows what is needed in order to download all images from certain album from Flickr.

Projects can be shared (not yet implemented), so that users can examine and adapt each others workflow and code.

How it works?

Basic work flow

Let's say you have a Excel-sheet of data. You then export it to csv format. Next you import it to MetaPipe. There you can view your data again in sheet format.

Then you can start editing. Let's say you have an author field with multiple person names in it. You can add a Transform node called Split, which splits values. Note that you do not edit the original data, but Split creates a new field. You can view the result and edit it manuallly if necessary.

Next you want to list all your authors. You create a Group node that uses the field that you just created with Split. This create a new collection (or table if you wish) that holds all unique author names from your data.

Now, if your data is intended for wikidata or commons, then it would be nice to have identifiers for persons. You can add a Wikidata lookup node. With that you can search wikidata with your author collection. Lookup results are saved to to your author collection.

Finally, you can use (Filter?, Transform?) to create a new field with author name and wikidata link. Then you export whatever format you need. Done!

Documented workflow

The very point of GLAMPipe is that it can be used for manipulation but that it - at the same time - documents what you have done with your data. This documentation can be then shared, so that others does not have reinvent the same workflow.

Documentation is saved to a MetaPipe file. By sharing this file, you can show exactly how you edited data. Others can then "fork" your project and use it as a starting point for their conversion of similar data.

Nodes

Node has two main parts: function, that defines what that node does, and view that defines how data is displayed and (possibly) manipulated.

Node scripts

Nodes can have several scripts that are executed when node is run. Let's say that node's functionality is to split string in certain field to pieces. The "run" script is then executed per record and script must set out.value that is then saved to database by MetaPipe.

Node views

Node has also views that define how data is displayed. View is a html page + knockout.js which can include javascript and therefore it can interact with MetaPipe. This means that you can build a view that allows interactive data editing.

Node parameter

Every node has parameters that are given when node is created. These can not be changed later.

Node settings

Node can also have runtime settings. For example, split node has a separator setting.

List of node types

  • Source
  • With source node you can import your data to metapipe. Your data can be in file (currently only csv) or data can be imported from API.

  • Process
  • Transform node is used to modifying fields or files (like images) in your data. Typical modifications are trimming spaces, splitting values, changing case, thumbnailing images and so on.

  • Group
  • Cluster node allows you to create a new collection from unique values of certain fields. For example, if you have an author field in your and you would like to have a list of all authors, you can use cluster.

  • Lookup
  • Lookup node can be used for combining data from different sources. Source can be a collection in Metapipe or a web resource like Wikidata or VIAF.org

  • Map
  • With map node you can rename and combine your fields.

  • Download
  • Download nodes can download images from different services like Flickr, for example.

  • Export
  • Export node exports data to file (CSV, XML) or API.

Technology

GLAMpipe is a node.js application and it uses MongoDB for data storage (without MongoDB this would be a really pain in the **s). It can be run locally (see installation) or it can be (possible) used as a web service.

But it really isn't a pipe, isn't it?

Well, GLAMpipe is really not a pipe or a dataflow tool (i.e. data flowing from one node to next). Instead, MetaPipe nodes are run once and result is saved to the database and after that next node can be run. The reason for this is the importance of ability to hand edit data in any phase. That would be very difficult task in dataflow-based application.