CONP Portal | Share
Tools/pipelines can be uploaded using Boutiques’ command
For more information on how to do that, please visit the ‘Publishing your own tool’ section of the Boutiques tutorial Python notebook.
Thank you for sharing your data with the CONP and the scientific community! Making your data Findable, Accessible, Interoperable, and Reusable is key to making your research findings impactful, scientifically and strategically useful to both you and your peers, and more likely to contribute to scientific discovery.
Below, we describe four different methods for adding your data to the CONP Portal. All methods have the following requirements in common:
README.mdfile: The content of this file will be displayed in the portal page describing your dataset. It is in Markdown format, to which there are many guides but here is one quick cheatsheet.
DATS.jsonfile: as described in the main documentation page. We provide a DATS GUI editor for easy creation of this file. Note: the content of the
DATS.jsonfile will used to populate various fields describing your dataset on its Portal page.
- A study/institutional logo can be added to the root directory of the dataset. If this is done, it will be used on your dataset’s Portal page along with the information in the README.md and DATA.json files that describe your dataset.
It is possible to upload a dataset using one of the following options:
If you need help at any stage, please open an issue in
the CONP-PCNO/conp-dataset repository and we will do our best to help you.
Upload your dataset to Zenodo with the specific keyword
canadian-open-neuroscience-platform. If your dataset is larger than 50GB you will need to contact Zenodo support using the request category of "File upload quota increase" before you will be able to upload it.
If you set your dataset as restricted, create a Zenodo Personal Access Token (via Applications > Personal access tokens > New Token). Check all scopes when creating the token and send the token via email to CONP Technical Steering Committee member Tristan Glatard (firstname.lastname@example.org).
Upload your dataset to the OSF with the specific tag
CONP supports both
Publicdatasets, ensure that the dataset is set to
Privatedatasets, do the following to ensure that the CONP automatic crawler can grep the OSF dataset and add it to the CONP super dataset:
- Ensure that the dataset is set to
Privateon the OSF
- In the
Contributorstab for the dataset, create user
CONP-BOTand grant it
- Do not add the
CONP-BOTuser as a Bibliographic Contributor.
- Ensure that the dataset is set to
This upload procedure requires some technical knowledge (GitHub, git, git-annex) and an account on GitHub, but offers some useful options and flexibility in data-hosting location.
The CONP datasets are managed using DataLad, a tool built on git and git-annex for managing digital objects such as datasets.
The CONP datasets currently require git-annex version>=8.20200309, and we recommend DataLad version>=0.12.5
We also recommend setting your
gitconfiguration to store your username and email:
git config --global user.name "your_user_name" git config --global user.email "email@example.com" git config credential.helper cache (keeps login information in memory for 5 mins)
Summary of the DataLad upload process:
- 1) Installing DataLad
- 2) Creating a new DataLad dataset
- 3) Populating the new dataset
- 4) Publishing the new dataset to GitHub
- 5) Testing the new dataset before adding it to the conp-dataset DataLad super dataset
- 6) Obtaining a Digital Object Identifier (DOI) for your dataset
- 7) Adding the new dataset to the list of CONP datasets (a.k.a. https://github.com/CONP-PCNO/conp-dataset)
on Linux: We recommend the Miniconda installation procedure detailed in the Install DataLad on linux-machines with no root access entry on the DataLad Handbook, which installs the most up-to-date versions of DataLad, git-annex, and git.
The first step in uploading a dataset to CONP via DataLad requires the creation of a new DataLad dataset that will be tracked on GitHub.
Create a local DataLad directory:
datalad create <new_dataset_name>
Create a sibling for your dataset on GitHub. The command below will generate a sibling in your local space:
cd <new_dataset_name> datalad create-sibling-github <new_dataset_name>
To inspect existing siblings, run
How to track the different files of the dataset:
All files in the dataset must be added to the repository using one of the two commands below. Copying content from another location into your local copy of the repository without using those commands will not work.
For data files on FTP or HTTP servers, use the
webremote to populate the data:
git annex addurl <URL_of_resource> --file <linkname>
--fileswitch is optional but recommended, because without it the default name for a link is built from the full URL of the resource and tends to be unwieldy and/or uninformative.
*NB: Generating the link requires enough space on your local machine to store the large data file, as
git-annexneeds to download the file to generate checksums.
For metadata files (e.g.
DATS.json, logo files), use
gitso that they are not annexed, and therefore are readable by the CONP portal and users:
git add README.md git add DATS.json git add <study_logo>
To save the changes made to the directory with DataLad, run:
datalad save -m "<a constructive message describing the state of the dataset>"
initial population of <name_of_new_dataset>
Ensure that all changes have been saved in DataLad (
datalad save -m "<message>"). Then, from the DataLad directory, publish the DataLad dataset to GitHub:
datalad publish --to github
5) Testing the new dataset before adding it to the CONP-PCNO/conp-dataset DataLad super dataset
Test that the dataset published on the new GitHub repository can be correctly downloaded:
Do a clean install of the new dataset GitHub repository on a different directory:
datalad install -r http://github.com/<your_user_name>/<new_dataset_name> cd <new_dataset_name>
-ris a recursive install that will install all submodules of the new dataset (if there are any).
Test that all dataset files download correctly from the URLs:
datalad get *
Datasets in CONP require a unique and permanent Digital Object Identifier (DOI) to make them citeable and retrievable. To get a DOI for your dataset, follow these steps:
Log in to Zenodo, preferably using your GitHub account.
Select the new GitHub dataset repository in the list of GitHub repositories in Zenodo.
Release the new dataset on GitHub (see instructions here). A DOI will automatically be created and it will create an archive of the new dataset on Zenodo.
Get the Concept DOI badge from the Zenodo list of GitHub repositories here. Add that DOI to your dataset's
README.mdfile and add its value to the identifier field of your
DATS.jsonfile. This DOI will always link to the latest release of the dataset.
To add the newly created dataset to the list of CONP datasets present in the DataLad conp-dataset super dataset, you will need to submit a pull request to
https://github.com/CONP-PCNO/conp-dataset. Circle CI will automatically test the newly added dataset to confirm whether files download correctly, validate the format of the
DATS.json file, etc.
Procedure to follow to add the new dataset to
On GitHub, fork a new copy of
https://github.com/CONP-PCNO/conp-datasetinto your userspace.
Install that copy locally:
datalad install https://github.com/<your_user_name>/conp-dataset.git
Install the new dataset into the
cd conp-dataset/projects git submodule add <new_dataset_github_repository>
<new_dataset_github_repository>will be of the form
Save the changes and push the branch to your fork
datalad save -m '<message>' datalad publish --to origin
Send a pull request (PR) from your fork's
masterbranch to the
https://github.com/CONP-PCNO/conp-dataset. You should see 2 file changes in the PR:
Modification to the
.gitmodulesfile to add the information for the new dataset. The added information should be of the form:
``` [submodule "projects/<new_dataset_name>"] path = projects/<new_dataset_name> url = https://github.com/<username>/<new_dataset_name>.git ``` *note: ensure that there is an empty line at the end of the `.gitmodules` file otherwise it will not pass the format-checking tests of your PR.
a link to the latest commit of the <new_dataset_name> GitHub repository