ModelCrawler

View the Project on GitHub

ModelCrawler Configuration

All configuration parameter for the ModelCrawler are stored in a json file. A basic default config can be generated by executing:

java -jar target/ModelCrawler-0.2.2-jar-with-dependencies.jar -c modelcrawler.json --template

Or you can use this example configuration:

{
  "workingDir" : "/tmp/modelcrawler",
  "encoding" : "UTF-8",
  "pathSeparator" : "/", 
  "extensionBlacklist" : [ "png", "bmp", "jpg", "jpeg", "html", "xhtml", "svg", "pdf", "json", "pl", "rdf", "rar", "msh", "zip", "htm" ],
  "tempDirPrefix" : "ModelCrawler",
  "workingDirConfig" : "config.json",
  "urnNamespace" : "model",
  "morreUrl" : "http://localhost:7474/morre/",
  "databases" : [ {
    "type" : "BMDB",
    "enabled" : true,
    "workingDir" : "wd-biomodels",
    "limit" : -1, 
    "ftpUrl" : "ftp://ftp.ebi.ac.uk/pub/databases/biomodels/releases/"
  }, { 
    "type" : "PMR2",
    "enabled" : true,
    "workingDir" : "wd-cellml",
    "limit" : -1, 
    "hashAlgo" : "MD5",
    "repoListUrl" : "http://models.cellml.org/workspace_list_txt"
  } ], 
  "storage" : {
    "type" : "file",
    "httpAccessPath" : "http://localhost/models/",
    "baseDir" : "/var/www/models"
  }
}

Common Settings

workingDir

Path to a directory the ModelCrawler uses to cache data, e.g. cloned git repositories in case of PMR2. It is not mandatory to persist this directory, but it will speed up the crawling process dramatically. Exception for this is the --no-morre mode, where these chached information are used to avoid duplicating model versions.

tempDirPrefix

Prefix used to create new temporary directories in the

tempDir

Path to a directory used as temporary directory. This directory does not need to be persistent between different runs of the ModelCrawler. If not set a temporary directory will be created in the systems default temp path, with tempDirPrefix and a random character string

pathSeparator

The character used for path separation. Set to \ (backslash) on Windows machines, otherwise leave it to / (slash).

encoding

The default encoding used. It is strongly recommended to set this to UTF-8.

extensionBlacklist

List of file extensions the ModelCrawler ignores without even checking, if the file could be parsed as a model file.

workingDirConfig

Name of the config file stored in the working directory. This file contains references to cached information so the crawling process is speed up.

urnNamespace

Namespace for the URNs, which are used as modelID. Default is model.

morreUrl

URL to Morre. Shall not include any endpoint specific URL parts.

databases

List of database implementations, cf. BioModels Database and PMR2.

storage

Definition of the target storage, cf. File Storage.

BioModels Database Settings

Configuration for crawling the BioModels Database.

type

Must be set to BMDB.

enabled

Must be set to true, otherwise this database won’t be crawled.

workingDir

Subdirectory in the common working directory for this database.

limit

Limits the numbers of releases, which are downloaded in during one execution of the ModelCrawler. Usefull to reduce the maximum runtime. If set to -1 no limit will be applied.

ftpUrl

URL to the FTP directory with all releases of BioModels Database. e.g. ftp://ftp.ebi.ac.uk/pub/databases/biomodels/releases/ Currently no authentification is supported, only an anonymous login.

PMR2 Settings

Configuration for crawling PMR2/the CellML model repository.

type

Must be set to PMR2.

enabled

Must be set to true, otherwise this database won’t be crawled.

workingDir

Subdirectory in the common working directory for this database.

limit

Limits the numbers of releases, which are downloaded in during one execution of the ModelCrawler. Usefull to reduce the maximum runtime. If set to -1 no limit will be applied.

hashAlgo

Hash algorithm used to generate working directory names from a repository URL. This config parameter must be an valid parameter for Java’s MessageDigest library. Defaults to MD5.

repoListUrl

URL to an txt file containing all public available repositories. This is an legacy settings, please use collectionEndpoin instead. Defaults to http://models.cellml.org/workspace_list_txt

collectionEndpoint

URL to the JSON/vnd endpoint listing all publicly available exposures. This is to be preferred over repoListUrl. e.g. http://models.cellml.org/exposure

File Storage Settings

Configuration parameter to use the filesystem as a target storage for the crawled models.

type

Must be set to file.

baseDir

Directory in which the hierarchical folder structure will be created and the model files will be written.

httpAccessPath

URL which points to a HTTP accessable resource, which serves the files and folders. This is used to create URL, which are then send to Morre for import. Therefore it is important that this URL (and sub-URLs) are accessable for Morre as well. Further Morre does not support HTTPS, therefore and inforcing policy regarding transport encryption will not work.