NAS4Free/FreeBSD 9.2 Stability Issues
Notes about how to get NAS4Free stable
NAS4Free is crashing unexpectedly after a period of approximately 1 day to 2 weeks. The crashes are far enough apart to make it difficult to determine the root cause.
Solution: Use FreeNAS.
Hardware
- Supermicro X9DRH-7TF
- a single E5-2609 CPU
- x540 NIC's (ixgbe)
- LSI 2208 SAS controller
- 32GB ECC RAM
- 8 WD Red drives (ZFS data)
- 2 Intel S3700 (ZFS ZIL)
- 1 Intel S3500 (ZFS L2ARC/Cache)
- IBM SAS M1015 controller in IT mode
- APC UPS
Attempts to remedy stability
Try anything that will either
- show any cause for the instability
- show the root cause for the instability
Action | Description |
---|---|
Upgrade NAS4Free |
In the hope that the new kernel/software addresses the issue. Kernel is now: FreeBSD 9.2-RELEASE-p3 #0 r260900M |
Swap power supply |
I've tried two reasonably power supplies. They are both 650W and have 12volt power for the CPU |
RAM Check |
The RAM is ECC Registered. There are no log entries to indicate that there is a RAM issue. The RAM has been checked with a memory tester |
Add serial console to capture messages |
The serial console had what looked like a flow control issue where the output was over-writing itself. No useful output captured |
Remote logging |
All logging was sent to syslog. No messages captured |
Disconnect UPS |
The network based UPS support was disabled to ensure it wasn't causing a reboot |
Add ZFS swap |
Add a ZFS based swap. If the kernel was having a short term memory issue then this might help it through. |
Add vanilla swap |
If the ZFS filesystem was causing the meory to get eaten it might compound the issue with having swap on the ZFS volume. Add a couple of SATA disks and put them in a mirror with a 64G slice for swap. I tried removing the swap; within 24hours the machine had problems and kill of the iSCSI target: kernel: pid 4113 (istgt), uid 0, was killed: out of swap space |
Optimise memory |
Run the ZFS Kernel Tune webGUI extension. The system was tending to only use half the 32G of RAM. |
Upgrade IPMI firmware |
|
Upgrade main BIOS |
|
Remove all disks from the onboard 2208 | The ZIL and L2ARC cache disks were connected to the LSI 2208 controller.
If that driver and/or hardware is the issue then removing the disks
might help. Move the SSD disks and the swap mirror to the onboard C602
SATA ports. Removing the disks from the LSI 2208 SAS controller improved the stability issue. |
memtest86+ memtest86 |
Run memtest86+ and memtest86 again just to confim there isn't a new (or existing) memory issue. |
Disable the 2208 |
Change the onboard jumper to disable the 2208 chip |
Network
Network connectivity is provided by two 10G ports running at 1G speed in a lagg with VLANs.
Residuals
Known issues that probably don't impact stability (but are likely to impact performance).
Item | Col 11 |
---|---|
Raidz2 with non-optimal number of disks |
Reading indicates that the number of disks should fit into a 128k stripe. i.e. for a radiz2 virtual device with 'n' disks: 128 / (n - 2) / sector-size should be a whole number. Otherwise performance will suffer. So eight 4k sector disks: 128 / ( 8 - 2 ) / 4 = 5.33 Whereas six 4k sector disks: 128 / (6 - 2) / 4 = 8 |
Listen backlog on apps seems low |
While looking for possible issues the listen backlog for several application seems low. the iSCSI target (istgt) has a very low listen backlog value. This causes the iSCSI client to generate a large number of messages when it tries to connect all targets back up at once (as most of them get rejected due to exceeding the backlog threshold). |
PCIe bus issues |
The machine/CPU is getting PCIe bus errors. |
Links
- http://forums.nas4free.org/viewtopic.php?f=71&t=1278
- man pages
- 9.2 ixgbe tx queue hang (was: Network loss)
- http://adrianchadd.blogspot.co.nz/2013/08/hacking-on-intel-10ge-driver-ixgbe-for.html
- Intel
- https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards
Appendices
Error message
kernel: sonewconn: pcb 0xfffffe01d7149ab8: Listen queue overflow: 2 already in queue awaiting acceptance last message repeated 111 times
PCI Info
mfi0@pci0:1:0:0: class=0x010400 card=0x069015d9 chip=0x005b1000 rev=0x05 hdr=0x00 vendor = 'LSI Logic / Symbios Logic' device = 'MegaRAID SAS 2208 [Thunderbolt]' class = mass storage subclass = RAID mps0@pci0:3:0:0: class=0x010700 card=0x30201000 chip=0x00721000 rev=0x03 hdr=0x00 vendor = 'LSI Logic / Symbios Logic' device = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]' class = mass storage subclass = SAS ix0@pci0:5:0:0: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = 'Ethernet Controller 10 Gigabit X540-AT2' class = network subclass = ethernet ix1@pci0:5:0:1: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = 'Ethernet Controller 10 Gigabit X540-AT2' class = network subclass = ethernet
ixgbe info
dev.ix.0.%desc: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15 dev.ix.0.%driver: ix dev.ix.0.%location: slot=0 function=0 handle=\_SB_.PCI0.NPE9.X54I dev.ix.0.%pnpinfo: vendor=0x8086 device=0x1528 subvendor=0x15d9 subdevice=0x1528 class=0x020000 dev.ix.0.%parent: pci5