CONP Portal | Share
Dataset addition procedures
It is possible to upload a dataset using one of the following:
Upload your dataset to https://zenodo.org with the specific keyword
canadian-open-neuroscience-platform. If your dataset is larger than 50GB you will need to contact Zenodo with a request category of
File upload quota increasein order to be able to upload it.
If you set your dataset as restricted, create a personal token via Applications > Personal access tokens > New Token > Check all scopes > Create and send the token via email to CONP Technical Steering Committee member Tristan Glatard (email@example.com).
Upload your dataset to https://osf.io/ with the specific tag
Ensure the dataset/project is set to
Privatedatasets will be supported in the near future.
DataLad is a software tool for managing digital objects such as datasets, built on git and git-annex.
To download CONP data currently (April 2020) requires git-annex version>=8.20200309, and we recommend DataLad version>=0.12.5.
We recommend setting your
git configuration to store your username and email:
git config --global user.name "yourusername" git config --global user.email "firstname.lastname@example.org" git config credential.helper cache (keeps login information in memory for 5 mins)
A.1 Installation on Linux
We recommend the Miniconda installation procedure detailed in the Install DataLad on linux-machines with no root access entry on the DataLad Handbook, which installs the most up-to-date versions of DataLad, git-annex, and git if needed.
A.2 Installation on Mac OS X
B. Creating a new dataset with DataLad
On github, fork a new copy of
https://github.com/CONP-PCNO/conp-datasetinto your userspace.
Install that copy locally:
datalad install -r email@example.com:<yourusername>/conp-dataset
-r flag to install directory structure)
- Create a local project:
cd conp-dataset datalad create -d . projects/<newprojectname>
- Create a sibling for this project on github. The command below will generate a sibling in your local space
datalad create-sibling-github -d projects/<newprojectname> <newprojectname>
The first "newprojectname" is local, the second is the name of the github repository that will be created in your personal github space. These do not need to be identical for the procedure to work, but it is recommended that they should be to avoid confusion.
To inspect existing siblings:
- Manually edit the
.gitmodulesfile in your local conp-dataset directory:
The last three lines of this file will contain an entry for your new project, but the format datalad currently generates is not functional. The lines should be edited to the following format:
[submodule "projects/<newprojectname>"] path = projects/<newprojectname> url = https://github.com/<yourusername>/<newprojectname>.git
Previous entries in the
.gitmodules file can be used as a guide.
Populating a new dataset
Choice of how to populate a new dataset will vary based on the special remote providing access to the data. The following procedure covers working with the web special remote. Alternative, more experimental options using other special remotes are documented [here] (https://github.com/CONP-PCNO/conp-documentation/datalad_dataset_addition_experimental.md)/.
All commands presented in the following sections should be run from
projects/<newprojectname> unless specified otherwise.
All datasets must include a
README.md in the root directory.
Adding metadata about your dataset is required.
All datasets must include a
DATS.json metadata file in the root directory as described in the main documentation page.
(It is not necessary to manually create these files when using Zenodo as the CONP Zenodo crawler automatically generates them.)
- For large datafiles on ftp or http servers, use the web remote:
git annex addurl <URL_of_resource> --file <linkname>
--file switch is optional but recommended, because without it the default name for a link is built from the full URL of the resource and tends to be unwieldy and/or uninformative.
NB: Generating the link requires enough space on your local machine to store the large data file, as git-annex needs to download the file to generate checksums.
- Add small files such as
README.mddirectly to your git repository. These will not be annexed:
datalad add --to-git ./README.md
C. Publishing a new dataset to GitHub
From your new project directory:
datalad save datalad publish --to github cd ../.. (this should put you in ~/conp-dataset/) datalad save datalad publish --to origin
Both of the save and publish steps are necessary, and must be done from the appropriate directories in the right order.
When adding new data to an existing project, the
publish --to github is replaced by another
publish --to origin command.
D. Testing the new dataset
- You should now have a git repository containing your new dataset correctly linked as a submodule of
<yourusername>/conp-dataset. Test this by downloading.
datalad install -r http://github.com/<yourusername>/conp-dataset cd conp-dataset/projects/<newprojectname>
This -r is a recursive install, so all subdirectories and small files should be present, and links to annexed files.
- Test that dataset files download correctly, either URLs (Web remote) or files (Globus remote):
datalad get [<url_name> | path/to/file]
E. Obtaining a Digital Object Identifier for your dataset
Datasets in CONP are required to have a Digital Object Identifier (DOI). A DOI is a unique and permanent identifier associated with a research object to make it citeable and retrievable. To get a DOI for your dataset, follow the following steps:
Log in to Zenodo, preferably using your GitHub account.
Select your GitHub repository at Zenodo.
Release your dataset on GitHub (see instructions here), which creates a DOI and archives your dataset on Zenodo.
Get the DOI badge from here, add it to the
README.mdfile of your dataset and add its value to the identifier field of your
DATS.jsonfile. This links to the DOI associated with the latest release of your dataset.
Submit a pull request to merge your dataset with
CONP-PCNO/conp-dataset. Travis-CI will automatically test your dataset to confirm whether files download correctly, validate the format of your DATS.json file etc.
F. Longer-term use and storage
We recommend that datasets be forked into https://github.com/conpdatasets to mitigate the risk of becoming inaccessible as the projects that generated the data conclude, depending on the circumstances of individual datasources.
If you need help at any stage, please open an issue in
the CONP-PCNO/conp-dataset repository and we will do our best to help you.