Personal tools
You are here: Home Storage Network Nas4Free NAS4Free/FreeBSD 9.2 Stability Issues

NAS4Free/FreeBSD 9.2 Stability Issues

Notes about how to get NAS4Free stable

NAS4Free is crashing unexpectedly after a period of approximately 1 day to 2 weeks. The crashes are far enough apart to make it difficult to determine the root cause.

Solution: Use FreeNAS.

 

Hardware

  • Supermicro X9DRH-7TF
    • a single E5-2609 CPU
    • x540 NIC's (ixgbe)
    • LSI 2208 SAS controller
    • 32GB ECC RAM
  • 8 WD Red drives (ZFS data)
  • 2 Intel S3700  (ZFS ZIL)
  • 1 Intel S3500 (ZFS L2ARC/Cache)
  • IBM SAS M1015 controller in IT mode
  • APC UPS

 

Attempts to remedy stability

Try anything that will either

  • show any cause for the instability
  • show the root cause for the instability

 

Action Description
Upgrade NAS4Free
In the hope that the new kernel/software addresses the issue.

Kernel is now: FreeBSD 9.2-RELEASE-p3 #0 r260900M
Swap power supply
I've tried two reasonably power supplies. They are both 650W and have 12volt power for the CPU
RAM Check
The RAM is ECC Registered. There are no log entries to indicate that there is a RAM issue. The RAM has been checked with a memory tester
Add serial console to capture messages
The serial console had what looked like a flow control issue where the output was over-writing itself. No useful output captured
Remote logging
All logging was sent to syslog. No messages captured
Disconnect UPS
The network based UPS support was disabled to ensure it wasn't causing a reboot
Add ZFS swap
Add a ZFS based swap. If the kernel was having a short term memory issue then this might help it through.
Add vanilla swap
If the ZFS filesystem was causing the meory to get eaten it might compound the issue with having swap on the ZFS volume. Add a couple of SATA disks and put them in a mirror with a 64G slice for swap.

I tried removing the swap; within 24hours the machine had problems and kill of the iSCSI target:
     kernel: pid 4113 (istgt), uid 0, was killed: out of swap space
Optimise memory
Run the ZFS Kernel Tune webGUI extension. The system was tending to only use half the 32G of RAM.
Upgrade IPMI firmware
 
Upgrade main BIOS

Remove all disks from the onboard 2208 The ZIL and L2ARC cache disks were connected to the LSI 2208 controller. If that driver and/or hardware is the issue then removing the disks might help. Move the SSD disks and the swap mirror to the onboard C602 SATA ports.

Removing the disks from the LSI 2208 SAS controller improved the stability issue.

memtest86+
memtest86
Run memtest86+ and memtest86 again just to confim there isn't a new (or existing) memory issue.
Disable the 2208
Change the onboard jumper to disable the 2208 chip

 

Network

Network connectivity is provided by two 10G ports running at 1G speed in a lagg with VLANs.

Action Description
ping host
Leave a ping running from a reliable directly connected host to provide a rough measure of packet loss.

This confirms the issue. A ping to other hosts shows zero or low packet loss.
try to understand what local_faults are
The NIC's are showing some 'local faults'. It could indicate a cabling issue. In theory these stats are not valid as the NIC is running at 1Gb.


x540-T2 mac stats


# sysctl dev.ix | grep fault
dev.ix.0.mac_stats.local_faults: 14
dev.ix.0.mac_stats.remote_faults: 0
dev.ix.1.mac_stats.local_faults: 4
dev.ix.1.mac_stats.remote_faults: 0

Ensure flow control is disabled
The switch has flow control disabled. TODO: find documentation as to what the bits of this register mean.

# sysctl dev.ix | grep fc
dev.ix.0.fc: 3
dev.ix.1.fc: 3

# sysctl dev.ix.0.fc=0
dev.ix.0.fc: 3 -> 0
# sysctl dev.ix.1.fc=0
dev.ix.1.fc: 3 -> 0


Disable lagg and vlan's.
This would take the ports out of use. Reading indicates other people have tried this and it isn't the root cause.
Increase buffers in hope of loosing connectivity less often
 

sysctl kern.ipc.nmbclusters=512144
sysctl kern.ipc.nmbjumbop=512144
sysctl kern.ipc.nmbjumbo9=512144

Residuals

Known issues that probably don't impact stability (but are likely to impact performance).

Item Col 11
Raidz2 with non-optimal number of disks
Reading indicates that the number of disks should fit into a 128k stripe.

i.e. for a radiz2 virtual device with 'n' disks:
   128 / (n - 2) / sector-size
should be a whole number. Otherwise performance will suffer.

So eight 4k sector disks:
  128 / ( 8 - 2 ) / 4 = 5.33
Whereas six 4k sector disks:
  128 / (6 - 2) / 4 = 8
Listen backlog on apps seems low
While looking for possible issues the listen backlog for several application seems low. the iSCSI target (istgt) has a very low listen backlog value.

This causes the iSCSI client to generate a large number of messages when it tries to connect all targets back up at once (as most of them get rejected due to exceeding the backlog threshold).
PCIe bus issues
The machine/CPU is getting PCIe bus errors.

Links

Appendices

Error message

kernel: sonewconn: pcb 0xfffffe01d7149ab8: Listen queue overflow: 2 already in queue awaiting acceptance
last message repeated 111 times

PCI Info

mfi0@pci0:1:0:0:        class=0x010400 card=0x069015d9 chip=0x005b1000 rev=0x05 hdr=0x00
    vendor     = 'LSI Logic / Symbios Logic'
    device     = 'MegaRAID SAS 2208 [Thunderbolt]'
    class      = mass storage
    subclass   = RAID
mps0@pci0:3:0:0:        class=0x010700 card=0x30201000 chip=0x00721000 rev=0x03 hdr=0x00
    vendor     = 'LSI Logic / Symbios Logic'
    device     = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
    class      = mass storage
    subclass   = SAS
ix0@pci0:5:0:0: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller 10 Gigabit X540-AT2'
    class      = network
    subclass   = ethernet
ix1@pci0:5:0:1: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller 10 Gigabit X540-AT2'
    class      = network
    subclass   = ethernet

ixgbe info

dev.ix.0.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15
dev.ix.0.%driver: ix
dev.ix.0.%location: slot=0 function=0 handle=\_SB_.PCI0.NPE9.X54I
dev.ix.0.%pnpinfo: vendor=0x8086 device=0x1528 subvendor=0x15d9 subdevice=0x1528 class=0x020000
dev.ix.0.%parent: pci5
Document Actions