software versions and identifiers #157

danielskatz · 2016-09-29T07:07:51Z

I know the first version of the software citation principles have just been finalized, but I want to suggest that we consider adding a bit more to the discussion section about software versions.

I was just talking with @mfenner and learned that DOIs have fields that could be used to record relationships. This suggests that there is at least one way that different versions of software could have identifiers but metrics could be collected on the whole family of versions of a software package.

I think @mfenner suggested that the github/zenodo link be modified so that the first time a package is released through zenodo, two DOIs would be be created: one for the package and one for the version. The metadata for the version would indicate the DOI for the parent as well. Then for future versions, the same parent could be identified in the version's DOI's metadata.

@mfenner, please correct anything I got wrong.

I don't know how this would be done in other services such as figshare, but I'm sure it could be worked out.

ScottBGI · 2016-09-29T08:29:42Z

At GigaScience we've found the RelatedIdentifier field in DataCite metadata very useful and intuitive for describing software, input and output data, updates, and even documentation.

For example our published SOAPdenovo2 genome assembly software (DOI:10.5524/100044) took an old input dataset (DOI:10.5524/100015) and created a new improved output (DOI:10.5524/100038). We describe this all in the DataCite metadata, as we use IsPreviousVersionOf and IsNewVersionOf to describe which is the old and improved data, and Compiles and IsCompiledBy to describe which is the input data and which is the software compiling it. We issued a correction to SOAPdenovo2 (DOI:10.5524/100148 ), so used IsNewVersionOf to describe that. And lots of other related tools and data that we link to this with IsSupplementedBy/IsSupplementTo tags. And the metadata has lots of other flexibility as you can link documentation with IsDocumentedBy, and describe relationships between modules or forks with IsPartOf, isVariantFormOf, etc. We lobbied Lobbied DataCite to get “workflow” as resourceTypeGeneral in their v3 metadata schema, so this schema can also describe the components of computational pipelines.

If you are looking for examples to use for this we have plenty we can contribute.

augustfly · 2016-09-29T13:30:00Z

how would the parent and the version 1 metadata differ (besides the relationtype)?

danielskatz · 2016-09-29T14:38:52Z

they wouldn't otherwise differ, I don't think

cboettig · 2016-09-29T19:27:41Z

I think the use of related identifiers field to denote different software versions sounds very promising, but I'm not quite sure that the idea of issuing two identifiers (one for the package as a whole and one for the version) when, say, a Zenodo archival record is first created is ideal. What would such a record actually look like? Which of the two would be the identifier for the datacite entry created? and what would the other point to? What DataCite relationship category would be used?

I think it makes more sense to assign a single identifier when the record is created, which is unique to the version actually archived at the time. When the record is updated (e.g. by new GitHub release in the automatic Zenodo model), then a new identifier is issued, creating a new DataCite entry which now includes the field "relatedIdentifer" pointing to the previous version and using the relationship IsNewVersionOf.

Sure it would be nice to have a way to cite / refer to the package as a whole vs the individual versions, but I think that is not particularly practical. Given good dataCite records with IsNewVersionOf, a citation crawler could aggregate citations across versions after the fact. No one wants to write two citations for the same thing, and having package identifier and version-specific package identifier sounds messy to me.

I recall we've also discussed this issue in setting up CodeMeta fields, so want to make sure we coordinate those recommendations with these; @mbjones might better remember our discussions and say if I'm off base here.

mfenner · 2016-09-29T20:28:51Z

@danielskatz described really well what we discussed earlier today. I think a different way to phrase this is to say that we want a persistent identifier for the specific release/version, but also a persistent identifier for the repo. In the Github world these two are clearly distinguishable, and, as Dan and Scott said, can be described in DataCite metadata.

There are a number of use cases for a persistent identifier for a software repo (as opposed to a specific release), one important reason is to aggregate all citations to specific versions for credit and attribution (principle #2). How else would you aggregate all citations for a piece of software, to the latest version, to all versions? The current implementation of issuing an identifier for principle #6 (specificity) can't also handle principle #2, unless there is only a single version of the software.

mfenner · 2016-09-29T20:32:40Z

You find the same idea also in the JISC recommendations for software citation: http://rrr.cs.st-andrews.ac.uk/wp-content/uploads/2015/10/guidelines-software-identification.pdf. They talk about a model of software entities and the "product" level vs. the "version" level, and the describe the usefulness of identifiers for the "product" level:

Using an identifier at this level may be appropriate to reference the general concept of a particular software artefact regardless of the specific version, or the continued use of this software over a long period,. It of use if different versions are going to be referenced as it can stand as a unifying record.

cboettig · 2016-09-29T21:22:59Z

@mfenner Thanks Martin! This clarifies a lot, and I'm all for giving a permanent identifier to the repo. For instance, I think this means that version-specific identifier would still be the one that corresponds to the zenodo record that gets created, and that record could simply refer to the source repository using the repo id instead of just the repo URL. It's not clear what the relationship property would be from the existing DataCite relation terms (IsVariantFormOf doesn't seem particularly precise), but maybe a new term could be created?

I assume the version-specific DOI would then resolve to the Zenodo record, and the package DOI would resolve to what? the GitHub repo? Or does it need to resolve to something more permanently archived? (If the latter, how would you archive something without archiving a particular version / snapshot?)

I'm also not sure that the notions of 'source code repository' and 'software product' are really 100% synonymous. I also didn't follow why we need two such identifiers to aggregate citations. Does this just assume people cite both identifiers, so that the citation count of the package ID can be used as the aggregate? (Do you think sum of citations over versions will always equal that of citations to the package ID?) It seems to me that the only way to do aggregate citations is to define what collection of identifiers are being aggregated (i.e. the most recent identifier and all other identifiers produced in walking the chain of isNewVersionOf relations) and counting up the total. (Presumably other applications like transitive credit require this kind of walking the chain anyhow).

Don't mean to be contrary here; I'm all for having identifiers for things and defining the relationships to them. It just seems to me that it's not essential to have an identifier for the 'product' level to accomplish the goals here, and am unclear about the practical side of how it would be implemented and used (as with the above question about where would that DOI even resolve to?).

npch · 2016-09-30T11:09:28Z

tl;dr: I think there's a difference between the concepts of linking specific versions of software, giving an identifier for a software repo, and giving an identifier which provides a way of referencing the abstract entity that is a piece of software.

We looked at this a bit with JORS and SSI related work.

It seemed to come down to there being two different reasons for wanting to cite software:

to reference a specific version to record it for publication (provenance)
to enable others to find the software, often the latest version (advertising)

Which map nicely to the use cases listed in 10.7717/peerj.2394/table-2 which either do or don't require software versions.

We questioned whether it was easier to do the second of these aggregate use cases by:

Just using DOI's for specific versions of the software with no linkage.
Using the DOI relationship fields to create "trees" of software DOI's, but essentially having a single type of identifier, which always pointed to a single version of the software.
Using a DOI for the abstract concept of the software itself (similar to the way earth observation data which is under continual updating is identified - CEOS Persistent Identifiers Best Practices although they do not use DOIs for specific versions in current implementations, but rather a different looser identifier)
Using a DOI to identify the software repository (which is hard to do sensibly unless the repository infrastructure provider assigns the DOIs)

In the end we went with 1) because it looked like the other options would have taken too much effort at the time to implement, and required the community as a whole to adopt one way or the other.

danielskatz · 2016-10-01T08:55:20Z

I think the case/reason that you (@npch) are leaving out is citation and credit, where the authors both want to get credit for a specific version, but also want to be able to roll up that credit into a credit for all versions of the software.

But I agree that at the time, using your 1 was a good choice. Now that we have a chance to influence the larger community when software citations standards move into an implementation phase, I think it might be time to consider other options.

mfenner · 2016-10-01T09:41:14Z

This is a very interesting discussion, and I agree with @danielskatz that the timing is right, as we are moving from principles to implementation.

The idea of an identifier pointing to the latest version of something is very popular, and is obviously how we navigate the web (only that for the average webpage it is very hard to go to a previous version). What I am advocating for is that this is not the best way to use identifiers for scholarly resources, as it doesn't properly address specificity and attribution. The main problem is that the thing the identifier is pointing to is changing with every version.

The IMHO much better implementation is to have an identifier that points to an abstract, versionless concept, rather than to the latest version. This versionless concept then links to specific versions. This helps with a number of use cases described in the software citation paper. This is also how software package repositories often work, see for example (I mainly use Javascript and Ruby) https://www.npmjs.com/package/bower or https://rubygems.org/gems/factory_girl.

The implementation using Github, Zenodo and DOIs is not quite following this pattern, and I guess that @danielskatz and I are suggesting that we should do so. The needed changes are probably the following:

whenever a Github revision is archived in Zenodo and a DOI is minted, Zenodo should check whether a DOI for the repository as a whole exists. And if not, that DOI is minted in addition to the revision-specific DOI. This repository DOI is for a "collection" and would point to a Zenodo record that lists all versions and their respective DOIs. This collection should also link to the Github repository, and ideally also to the package repository (Rubygems, NPM, PyPI, CRAN, etc.) using the relatedIdentifier metadata field. There is no need for Github to mint this DOI, this can happen via Zenodo {to the point raised by @npch ).
in the Github README we want at minimum to link to the collection DOI and to the code package repository (NPM or Rubygems, etc.), if the software is also indexed there. Ideally the README also lists all versions, and their DOIs, and links to the Github releases for them. This would follow the pattern the user sees when going to the code package repository and Zenodo (if the above changes are implemented).
DataCite needs to implement an appropriate relationType for this scenario (to the point raised by @cboettig). I think DataONE is using "HasVersions/IsVersionOf", which seems a good fit for describing this parent/child relation, different from the sibling relations between different versions, described by "IsNewVersionOf/IsPreviousVersionOf". Luckily the DataCite Metadata Working Group is currently working on how to support software citation for the next release of the metadata schema.

I understand that code repository doesn't equal software, but it is a good proxy for a lot of open source software. And for the other cases I think we still need these two identifiers, just pointing to something else.

We should also not forget that collecting software citations is really hard, and we need all the help we can get. Having an parent identifier for all versions that links to all the citations found is extremely powerful, as we don't want everyone to aggregate the citations to different versions himself, in the worst case with different results.

lnielsen · 2016-10-11T00:20:48Z

The approach we are planning on pursuing for Zenodo is the one described by @mfenner. One "container"-DOI, plus a "version"-DOI per release. Some users prefer having the container-DOI cited, whereas others prefer having the version-DOI cited.

"HasVersions/IsVersionOf", which seems a good fit for describing this parent/child relation

Wouldn't it be possible to simply use isPartOf/hasPart instead?

One complexity that we have to deal with is that hasPreviousVersion/isNewVersionOf does not model semantic versioning very well, especially in the cases were releases happen out of order (e.g. 1.1, 1.2, 1.1.1, ...)

mfenner · 2016-10-11T09:49:34Z

@lnielsen I am happy to hear that Zenodo plans to issue "container"-DOIs in addition to "version"-DOIs. IsPartOf/hasPart is a good starting point, but I think in the long run it would be better to have a different relation type, as IsPartOf/hasPart is probably not the best fit for versions, but rather would probably be used if a software package has several dependencies/parts.

cboettig mentioned this issue Sep 29, 2016

Software identifier vs version identifer? codemeta/codemeta#113

Closed

alee mentioned this issue Dec 8, 2016

software citation comses/comses.net#13

Closed

alee mentioned this issue Jan 4, 2017

software citation comses/comses.net#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

software versions and identifiers #157

software versions and identifiers #157

danielskatz commented Sep 29, 2016

ScottBGI commented Sep 29, 2016

augustfly commented Sep 29, 2016

danielskatz commented Sep 29, 2016

cboettig commented Sep 29, 2016

mfenner commented Sep 29, 2016

mfenner commented Sep 29, 2016 •

edited

Loading

cboettig commented Sep 29, 2016

npch commented Sep 30, 2016 •

edited

Loading

danielskatz commented Oct 1, 2016

mfenner commented Oct 1, 2016

lnielsen commented Oct 11, 2016

mfenner commented Oct 11, 2016

software versions and identifiers #157

software versions and identifiers #157

Comments

danielskatz commented Sep 29, 2016

ScottBGI commented Sep 29, 2016

augustfly commented Sep 29, 2016

danielskatz commented Sep 29, 2016

cboettig commented Sep 29, 2016

mfenner commented Sep 29, 2016

mfenner commented Sep 29, 2016 • edited Loading

cboettig commented Sep 29, 2016

npch commented Sep 30, 2016 • edited Loading

danielskatz commented Oct 1, 2016

mfenner commented Oct 1, 2016

lnielsen commented Oct 11, 2016

mfenner commented Oct 11, 2016

mfenner commented Sep 29, 2016 •

edited

Loading

npch commented Sep 30, 2016 •

edited

Loading