====== TSD-P33 meeting minutes ======


===== Planned: Next meeting ...... at USIT =====

USIT room 3212 Gaustadaallen 23 a

===== 23.10.2019 at USIT =====

  * how to use dedicated colossus queue \\ https://www.uio.no/english/services/it/research/sensitive-data/use-tsd/hpc/dedicated-resources.html

===== 11.09.2019 at USIT =====

minutes:
Here are the meeting minutes and to-dos:

  * Gard info on status and that new storage is signed one of these days. Info will come once public.
  * New 10GbE export computers are racked and will be set up asap for export from /cluster to backup and /cluster to VMs, this will hopefully remove the bottleneck feeling of moving data from /cluster /data/durable
  * Linux machines should be replaced with RHEL7 asap, and p33 has two RHEL6 machines with lots of memory that are unused due to /cluster lacking and software incompatibility In this regard P33 and p22 and p19 are eager to test using RHEL7 + Horizon view.
  * Singularity V3 is needed on all VMs not only the submit hosts, Singularity V3 might have RHEL 6 issues, so we need to move on with RHEL7 as standard linux machines.
  * Azab to fix, if needed new VMs first. Make new, then delete old then okay with users.
  * GPU nodes for P23, possible to make them available in the general P23 queue without “giving it all away to non-GPU-users”
  * NB : There are GPUS (v100) for general use in Colossus as well, please try, and if DeepVariant is not installed, tell us to
  * Dragen - issue with missing R packages for the CNV calling , issue on $scratch, - Azab to fix packages and to investigate the $scratch vs IB $scratch.
  * Issue of incoming large files over slow network (but steady) from US (UK biobank), P33 and Leon must look at this.
  * There is a need for simples system towards Sigma2
  * A plan is needed for the disk-upgrade procedure.
  * We will look into moving data for p22 from durable2 to durabel 3 and manipulate backup accordingly. Please give us the exact path of what is to be moved and where it is supposed to end up.
  * General recommendation : \\ For long-running jobs on VMs it is definitely recommended to run some kind of terminal multiplexer like screen or tmux, and to have the terminal multiplexer running on the system where the job is running, so you're not vulnerable to network issues/instability.

topics:
  * Status of RHEL7 login nodes?
  * ''p33-rhel7-hpc'' upgrade to 64GB RAM? - ticket was closed
  * ''p33-serviceuser'' - transfer ownership of files
  * ownership of base dir - prevent other users from creating files (''/cluster/projects/p33'', ''/tsd/p33/data/durable'')
  * 'research data archive' inside TSD (move data to cheap tape storage when a paper is published?)
  * ''qsumm'' only shows p33 - does not show running jobs from squeue
  * Start recompiling redhat6, ahead of redhat7 upgrade 
  * Sigma2 application coming up.. 
  * Start uploading to the Hitachi disk
  * Do we need direct upload to cluster? 
  * HubNode application in progress

TSD plans:
  * NFS export (file server) - TSD will install new NFS export machines (one for backup + one for VMs)
  * possible to buy new storage now
  * new storage which will replace HNAS and BeeGFS - around christmas (early adopter / testers needed)
  * installing new software - easy build (TSD staff will be more fast to install software)
  * service for projects to create new users
  * thin link will be replaced by VM ware
  * rhel7 login nodes - test vm ware first
  * singularity recopies to make user interfaces run
  * 


===== 19.6.2019 at USIT =====

TSD plans: 
  * file server / BeeGFS update in December
  * self service for user account creation to be released end of summer
  * 2 GPU nodes will be made available
  * instant clone of login VMs
  * singularity upgrade (1.2 -> 1.3) not possible yet due to security issues
requests:
  * file import directly to ''/cluster'' 
    * can be implemented in August if urgent 
    * otherwise will be fixed with upgrade of BeeGFS nodes in December
  * dedicated user needs to be tested by p33

===== 06.03.2019 =====

Dag, Florian, Francesco, Gard, Leon, Ole, Thomas (Dennis and Tobias via Skype)

  * beegfs works now again on windows
  * communication: use p33 mailing list cc everyone on relevant tickets
  * ''/cluster''/beegfs disk became more stable (no crash in 2.5 weeks) + faster response times
  * dedicated compute nodes on TSD 
    * 1,5 nodes are in in production
    * we get info how to use nice values in own queue (to manage priorities of jobs)
    * if we want to by more: 900,000 NOK for 256 cores (4 nodes with 64 cores each)
    * how to use dedicated nodes 
      * qsumm - dedicated resources
      * ''#SBATCH --reservation=norment''
      * more info: https://www.uio.no/english/services/it/research/sensitive-data/use-tsd/hpc/dedicated-resources.html
  * status /cluster : Better response times, no crash in 2.5 weeks
  * advanced user support
  * quota on /cluster has been increased last month by ~200TB
  * GPUs on TSD are available:
    * 2 NVIDIA accelerators for LCBC and 4 in the public queue
  * manual rhel7 login machines work - a release is being worked on
  * next meeting: 19.6. wed 14.00 at usit


===== 06.03.2019 =====

Alex, Dag, Florian, Francesco, Gard, Thomas

topics:
  * SLURM: increase slurm max jobs for genotype imputation (discussed on 29.11.2017)
    * wait until colossus 3 cluster will be in place (within this month)
      * /cluster and HPC downtime for about one week - p33 users will be notified via mail
      * p33 will get it's own queue 96 cores / 1.5 nodes
      * will have new parallel disk - might have new mount point
  * self-hosted services / data ownership: dedicated user account possible?
    * Leon will look into this
  * data ownership: how to manage data to make IRB happy? timeline? march 15th is moba meeting
    * create sub-project - E. Hovig has similar setup
    * need to write mail regarding setup directly to Haneef to meet deadline
  * overview of our disks and status of ordered disks
    * 2x96 TB (2017) are currently in place
    * 1x96 TB (Aug.2018) were lend to E. Hovig
    * ~200 TB (Dec. 2018) will be bought for colossus 3 cluster
  * eConsent: How to start process? What needs to be in place on p33?
    * need to put text into nettskjema
    * will be stored by TSD
    * accessible via web interface inside TSD
  * import MRI data:
    * data has finished coping to taxi-disk
    * will now be moved into TSD
  * backup: path to /cluster backups?
    * /cluster is put on tapes. once needs to send a mail for recovery
    * backup covers last 7 versions for up to 90 days
  * data ownership: how to remove files left by former users?
    * will be discussed
  * data ownership: make top level-directories read-only
    * ask Bjørn-Helge
  * file-system: current shell location lost (interactive and non-interactive shell)
    * won't change for /cluster - problem of parallel file-system
  * software: preinstall all(?) R packages. because they are hard to install and depend on system libraries
    * ask Bjørn-Helge
  * file-system: tmp dir on /cluster (without naming nobackup; how to purge - without breaking services; lsof?; only dated files)
    * will be discussed
  * self-hosted webservices: DNS and webserver seem to exist for password change. Can be reused?
    * is being looked into
  * Next meeting?

===== 28.11.2018 =====

Dag, Florian, Francesco, Gard, Steffen, Thomas, Torbjørn, Ole

  * how to spend spend 2x200k: 
    * Gard will make offer for compute and storage (colossus 2.0)
  * self-hosted services / data ownership: dedicated user account possible?
    * will be checked
  * windows VM: possible to auto mount network disks system-wide? (difficult/unintuitive for new users)
    * will be checked. 
  * SLURM: p33 queuing. Problem: p33 users spamming queue with long-lasting jobs -> urgent jobs end up with days of waiting time. 
    * with colossus 2.0, p33 will buy dedicated compute nodes with own queue(s) and own access control.
  * weblogin: change 'passcode' to 'one-time password' in view.tsd.usit.no/portal/webclient/index.html
    * problem has been reported to vmware. 
  * Windows VM: Allow users to run any binaries?
    * unlikely to happen. too much malware for windows (bitlocker, etc). 
  * TSD support: fast queue for ticket system?
    * selected p33 users can call Haneef, Leon, and Gard. In given order.  
  * import/export: p01-fileapi ownership was fixed
  * module avail takes long time:
    * will be checked
  * import EEG data: 
    * ~400 samples with 2GB each
  * import MRI data: 
    * Plan: start with raw data and use disk-taxi before xmas
  * can't load modules on TSD, 'module avail' shows empty list (fixed already!)

===== 12.09.2018 =====

Alexander, Benjamin, Florian, Francesco, Gard, Haneef, Steffen, Thomas, Ole


  * self-hosted webservices: dedicated user possible?
    * Seems possible. Will be checked.
  * self-hosted webservices: Is network traffic hidden from other projects (safe to use)?
    * Yes. Network traffic is separated from other projects on TSD.
  * self-hosted webservices: intranet entry page possible? (dns redirect / http redirect / captive portal)
    * Seems possible. Will be checked.
  * Windows VM: status Windows client for BeeGFS
    * Is being tested internally at TSD.
  * backup service: span more time? currently one week; maybe for special folders
    * Backup folder spans 7 days, but backups exist for 90+ days (on tape)
    * Write ticket to have something restored. 
    * Exact time backups are kept: https://www.uio.no/tjenester/it/maskin/filer/backup/
  * import/export: no owner visible for files in file-api folder; p01-fileapi user
    * Current owner (p01-fileapi) would be difficult to change
    * Idea: place file in ''file-api/<username>/'' - Seems possible. Will be checked.
  * SLURM: available resources / expected waiting time?
    * All tools for monitoring jobs exist
    * Use ''/cluster/bin/qsum'' and ''squeue'' to check cluster usage
  * import/export: Has new file API been tested by p33?
    * Yes. Looks good.
  * import/export: move MRI data to TSD
    * MRI data is currently sitting on disks at NORMENT. Disk taxi has been prepared but uploading using new file-API seems to make more sense. 
  * import/export: Ability of labs to upload files to TSD (instead of delivering directly to NORMENT)?
    * Is possible. Anonymous upload via S3 file API

===== 13.06.2018 =====

Alexander, Florian, Francesco, Gard, Leon, Steffen, Thomas

  * File import/export:
    * New file API is in place 
    * New Webinterface is at https://data.tsd.usit.no/ (not official yet)
  * Windows VM: Mounting /cluster on Windows login VM
    * Not ready yet - Software is in beta
  * Disk space: 
    * Another 96 TB disk will be ordered for BeeGFS/cluster (third one) (p33 consumes currently > 100 TB)
  * TSD performance: 
    * More VM should be ordered when experiencing performance bottlenecks
  * TSD access: new web interface to register TSD users (self-service)
    * User registers with Norwegian ID
    * Admin (Thomas Bjella) assigns user to P33 project
    * Web page will provide QR code and password (no mail anymore)
  * Linux VM: default OS R version was updated (3.4.4 -> 3.5.0) which breaks all installed packages
    * Write ticket with CC Gard and Leon
  * Docker/Singularity: TSD has increased support for containerization
    * Users should prefer Singularity over Docker (avoids super-user privileges)
    * https://www.uio.no/english/services/it/research/hpc/abel/help/software/singularity.html
  * Software: Possible to have Code management / GitHub enterprise on TSD?
    * Some code management system is needed - maybe in the future.
  * Software: install-user friendly GUI to manage database
    * pgadmin is already available 
    * TSD will look into Microsoft SQL Server Management Studio
  * Software: Possible to install software similar to Share Point (for collaborative work on documents)?
    * Open Atrium is available on TSD


===== 29.04.2018 =====


Florian, Francesco, Gard, Ole, Thomas

  * Windows VM: Mounting /cluster on Windows login VM
    * will be available soon
  * Linux VM: BeeGFS file permissions are changed from group read-only to rw
    * Andreas is contact person
  * Linux VM: load on Linux VM
    * TSD generally tries to identify IO bottlenecks
    * possible to get more machines
  * import/export: low disk space on login node + API status
    * new 36 hours deletion time
    * new API is available
    * drag and drop will be available before summer
  * Linux VM: possible to clone the RPM state between Linux VMs? Might save time instead of manually installing missing software packages
    * will be investigated
  * Next meeting: 
    * Wed, 13.06.2018, 15.00



===== 29.11.2017 =====

Alexander, Florian, Francesca, Francesco, Gard, Leon, Ole, Thomas

  * Login Node: 
    * Big thanks to TSD for upgrading the linux VM to 128 GB memory. For now this solves lack of interactive colossus nodes. Another 256 GB VM is possible if really needed.
  * Login Node:
    * Second login node is in place (hostname: p33-tl02) - ssh from p33-tl01 available.
  * Disks space: Status of 2x 96 disks? 
    * New disks are now physically installed. P33 disk quota will be increased soon. 
  * SLURM: Increase maximum amount of simultaneously queued jobs? (currently 400 per user)
    * Can be increased. 400 is default limit.
  * SLURM: Show status of queuing system? (currently only info for own jobs are visible)
    * Will be investigated.
  * Import files: How to import / sync large data sources to into TSD? 
    * New system with streaming upload and resumable upload is being worked on. P33 will send a list with use cases
  * Import files: slow NIRD -> TSD upload
    * P33 to send stacktrace from NIRD -> TSD upload (low throughput 2 MB/sec, unless ssh tunnel through abel)
  * Import files: Filelock disk capacity
    * Filelock is not cleaned up: it was constantly at 1.3 TB of free space over last two month
  * File permissions: Possibility of a p33 administrator role being able to lock and clean directories? (currently files of other p33 users can not be moved or removed)
    * Will be investigated. 
  * Next Meeting? 
    * 24.01.2018, 15.00 