2  Raw data

2.1 What is raw data?

In bioinformatics, raw data are unprocessed output from a specific technology. It is data that you cannot regenerate (or not easily), and that should never be modified. The type of files can vary, but a good exemple for sequencing technologies is .fastq files.

Sometimes, the sequencing platform will also give you some pre- or fully analyzed data. An example in the case of single-cell RNA-seq is cell-by-gene matrices. If it is the basis of your analysis, you should consider them just like raw data.

These files are required to publish.

2.2 Where to save raw data?

At the IGBMC, there are two storage infrastructures for scientific data, their usage is described here, but as a reminder:

Storage Usage Replication Daily snapshots
mendel Data analysis, active projects Yes Kept 30 days
space2 Data conservation, finished projects No Kept for 3 days

So in practice, you should save your raw and analyzed data to a mendel space.

Once your analysis is finished and ready for publication, you should upload them on a database, and you can now leave them on space2.

Once your paper is published, and the data is accessible online for all (e.g. on GEO), you can remove from space2 any file that is now accessible online. But do add a .txt file that specifies how the data can be retrieved (e.g an accession code).

2.3 Steps to save raw data

So here are the general steps on how to correctly save raw data:

  • Download the data and store them securely: Save them on a mendel space or a backed-up institutional storage. Do not save your data only on your local computer, if it is broken, lost, or stolen, the data will be lost forever.

  • Verify its integrity: Along with your data, you should have a way to verify the data was not corrupted during its download. For example, you might be given the md5 of files. You must always keep the file containing the original md5 values together with the data.

Note

A md5 is 32-character fingerprint of the file. If the file is not exactly the same as the original one, recomputing the md5 will give a different result and you will know that your file was corrupted (i.e. modified, truncated…).

Tip

In R, you can for example run tools::md5sum("/path/to/file") to get the md5 of a file.

  • Do not touch it anymore: Make the files only readable. You can go to the Properties of a file and set its permission to Read only. Or using command line, you can restrain the modification of a folder with chmod -r 444 /path/to/folder. This also includes not changing the name of the files! If you need to put some data in another folder, or with another name, you can create symbolic links with ln -s in bash, or a shortcut file.

2.4 What about metadata?

Data is a collection of raw and unorganized information. Without proper processing, organizing, and information about it, it cannot be interpreted. Metadata is data about data. It gives a description and a context about the data. It can be the species, tissue type or conditions of the samples, as well as the lab protocol, sequencing technology etc…

If you filled GenomEast’s LIMS correctly, you already have some of that information. You can save the project page of the study as a pdf and put it in the folder that contains the data. It contains general information about the study (ID, title, technology, date, collaborators, aim, description of samples…). You can also fill and export the table of the Samples conditions in .csv format.

A metadata file for your samples can be in a excel, .csv, .tsv, .txt format, or any format that can be easily exploited in further analysis in a programming language. What is important that:

  • each samples have a unique identifier
  • one variable corresponds to one column
  • have consistant values inside of the column

For example:

,Condition_treatment
Sample1,Control_6
Sample2,TreatmentA_6
Sample3,TreatmentB_6
Sample4,ctrl_9
Sample5,TreatmentA_6
Sample6,TreatmentB_6

is bad! There are two variables in the same column (Condition_treatment should be seperated into Condition and Treatment ), and the values in this column are not consistent (Control is not the same as cntrl).

,Condition,Age
Sample1,Control,6
Sample2,TreatmentA,6
Sample3,TreatmentB,6
Sample4,Control,9
Sample5,TreatmentA,9
Sample6,TreatmentB,9

is good.