[Mageia-sysadm] Jonund and Ecosse restarted...

Thomas Backlund tmb at mageia.org
Tue Aug 9 17:55:40 CEST 2011

Pascal Terjan skrev 9.8.2011 16:37:
> On Fri, Aug 5, 2011 at 00:33, Thomas Backlund<tmb at mageia.org>  wrote:
>> Hi,
>> Since both Jonund and Ecosse had dropped some of their build speed,
>> I checked them out.
>> both had zombie rpmbuild processes with the oldest dating about ~8 days ago,
>> slow disk io and Ecosse hit ATA Bus Reset errors.
>> So I restarted both to flush out the memory and re-init the disc
>> controllers. both are now running nicely again.
> ecosse is very very slow and looking at dmesg, it seems to have happened again
> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata3.00: failed command: FLUSH CACHE EXT
> ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>           res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata3.00: status: { DRDY }
> ata3: soft resetting link
> ata3.00: configured for UDMA/133
> ata3.00: retrying FLUSH 0xea Emask 0x4
> ata3: EH complete
> urpmi installs a few packages per minute only, spending most time in D
> state while nothing else is running on the machine


this one seems to start the whole mess:
INFO: task jbd2/dm-0-8:633 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/dm-0-8   D 0000000000000002     0   633      2 0x00000000
  ffff88022e3a7b40 0000000000000046 ffff88022e3a7b00 ffffffff8104d7ea
  0000000000015840 0000000000000282 ffff88022e3a7af0 000000006c462645
  ffff88022e3a7fd8 000000000000faf0 ffff88022f172da0 ffff88022dca16d0
Call Trace:
  [<ffffffff8104d7ea>] ? try_to_wake_up+0x2da/0x410
  [<ffffffff81084609>] ? ktime_get_ts+0xa9/0xe0
  [<ffffffff810e9680>] ? sync_page+0x0/0x50
  [<ffffffff813c3b33>] io_schedule+0x73/0xc0
  [<ffffffff810e96bd>] sync_page+0x3d/0x50
  [<ffffffff813c418f>] __wait_on_bit+0x5f/0x90
  [<ffffffff810e9843>] wait_on_page_bit+0x73/0x80
  [<ffffffff8107a1d0>] ? wake_bit_function+0x0/0x40
  [<ffffffff810f3bf5>] ? pagevec_lookup_tag+0x25/0x40
  [<ffffffff810e9c5d>] filemap_fdatawait_range+0x10d/0x1a0
  [<ffffffff810e9d1b>] filemap_fdatawait+0x2b/0x30
  [<ffffffffa0008df5>] jbd2_journal_commit_transaction+0x745/0x12f0 [jbd2]
  [<ffffffff8106b53b>] ? try_to_del_timer_sync+0x7b/0xe0
  [<ffffffffa000f193>] kjournald2+0xb3/0x200 [jbd2]
  [<ffffffff8107a190>] ? autoremove_wake_function+0x0/0x40
  [<ffffffffa000f0e0>] ? kjournald2+0x0/0x200 [jbd2]
  [<ffffffff81079c66>] kthread+0x96/0xa0
  [<ffffffff8100ae24>] kernel_thread_helper+0x4/0x10
  [<ffffffff81079bd0>] ? kthread+0x0/0xa0
  [<ffffffff8100ae20>] ? kernel_thread_helper+0x0/0x10

IIRC, it was fixed in newer kernels, but since mdv still fails to 
release fixed kernels for 2010.1 I guess I'll either release a fixed 
kernel myself for Mageia BS, or simply rebuild the Mageia 1 kernel for it.

One other thing we could do for now is to only have one buildbot running 
until the kernel is fixed.


More information about the Mageia-sysadm mailing list