Earlier this week we upgraded slurm (our scheduling software) to try and fix several bugs that had been identified over the past few weeks through working with our support vendor. This upgrade addressed some of the reported problems but also introduced a few configuration issues that impacted user's abilities to run jobs earlier in the week. In addition to resolving these new problems, we have been working to identify and address the issues with failed jobs and nodes that users have been experiencing that were not fixed through the upgrade.
We understand that these issue have impacted the ability to conduct research on the cluster, and we are working diligently to resolve them. To help expedite this process we have engaged an external consulting firm as well as a team of HPC experts from Dell to work closely with our staff to bring the cluster back to full capacity.
Over the last day as we have made adjustments, we have seen an increase in the job completion and job reliability rates on Taki. We believe this to be a good indication that we are getting closer to identifying some of the root causes of these issues but we know we aren't there yet.
We appreciate the patience and the continued communication from all of our users. If you continue to experience issues, please open tickets with all much detail as possible so we can continue to diagnose and triage these issues. The information we get from your tickets is essential to resolving these issues.
If you have any questions or concerns, please email me and I'll be happy to discuss these issues and the status of the cluster in more detail.
Thank You,
Damian
--
Damian Doyle
Assistant Vice President
DoIT - UMBC