2024-05-14
(05-15 9:30 Update) Job ID 183 cases listed. Prevention of recurrence and future actions added.
The following storage failure has been restored as follows:
- Date
Failure time: 2024/5/12 22:32:55
Recovery time: 2024/5/14 08:17:31
2. Details of the disorder
MDS (Metadata Server) of Lustre file system that constitutes /gs/fs area and /home area is unstable due to high load, and access is delayed or fails.
This causes following phenomena: Unable to login to the login node, errors are displayed on the TSUBAME Portal, and the compute nodes that detected the abnormality refuses to accept new jobs.
3. Affected Jobs
The number of jobs that may have been affected by this failure (jobs that were running during the failure) is 1274. The following 183 jobs were to have been running in the /gs/fs and HOME areas.
86002 86007 86931 86953 87027 87031 87082 87087 87102 87103 87217 87218 87724 87730 87744 87773 87781 87797 87803 87805 87816 87817 87936 87938 87940 87967 87980 87988 88011 88039 88052 88061 88062 88063 88064 88084 88102 88103 88105 88107 88127 88128 88129 88130 88131 88132 88133 88134 88135 88136 88137 88138 88139 88140 88141 88142 88143 88144 88145 88146 88147 88256 88260 88267 88268 88269 88270 88271 88273 88294 88295 88327 88328 88330 88331 88332 88333 88334 88366 88367 88381 88393 88394 88395 88400 88436 88437 88440 88441 88442 88447 88449 88452 88453 88470 88498 88499 88502 88506 88508 88513 88514 88522 88525 88526 88527 88538 88545 88547 88554 89567 89572 89573 89574 89575 89576 89577 89593 89599 89600 89602 89642 89644 89645 89646 89647 89648 89649 89650 89651 89659 89660 89661 89662 89663 89664 89665 89666 89667 89668 89669 89670 89671 89672 89673 89674 89675 89676 89677 89678 89679 89680 89681 89682 89683 89684 89685 89686 89687 89688 89689 89690 89691 89692 89693 89694 89695 89696 89697 89698 89699 89700 89701 89702 89703 89704 89705 89706 89716 89718 89719 89723 89730
4. Prevention of recurrence and future actions
・Emergency measures were taken to avoid a bug that may have caused the high load.
・Points will be returned for metered use and reserved jobs.
・The MDS service is being deactivated and a failback will be performed. The work will be completed in a few minutes, but will be done after another announcement because I/O may hang during this time.