Skip to contents

For reproducible code analysis, it is important that the same data files are loaded. However, it is generally not useful to include specific data files, since there are several that are analyzed and sometimes more files will be analyzed with the same method later on. Therefore, raw.findFiles is used to find specific files based on criteria.

It can happen, however, that more data is added later on, which changes the analysis code as the raw.findFiles returns a different set of data files. In order to avoid listing specific file names and re-use of the code, we can generate MD5 checksums. Here is an example.

path.RAW = raw.getSamplePath()
myData = raw.findFiles(path.RAW, user='TG')
md5 = raw.getMD5str(myData)
print(paste("The MD5 checksum for all currently used files are:",md5))
#> [1] "The MD5 checksum for all currently used files are: 7545cd"

There are at least 2 scenarios where this is useful:

(1) If more files are added at a later point that could potentially show up with the same raw.findFiles search, we can restrict to the exact same files as before.

(2) If the RAW filename is changed for some reason, maybe corrected (2021 instead of 2011) or a sample name is corrected, then the analysis will be based on the same data regardless of the filename.

path.RAW = raw.getSamplePath()
myData2 = raw.findFiles(path.RAW, user='TG', md5='7545cd')
myData==myData2
#> [1] TRUE