Saturday, 26 November 2016

stress-ng 0.07.07 released

stress-ng is a tool that I have been developing on-and-off for a few years. It is designed to stress kernels to force out bugs, stress CPU and memory and also contains some performance benchmarking metrics too.

stress-ng is now entering the maturity part of the development phase, however, there is always scope to add new stressors and generally improve the tool.   I've just released version 0.07.07 for the Ubuntu Zesty 17.04 release and it contains a few additional features:
  • SIGUSR2 sent to stress-ng will dump out the current system load and memory statistics
  • Sched policy stress tests for different scheduler configurations
  • Add a missing --sockfd-port option
And various bug fixes:
  • Fixed up some minor memory leaks
  • Missing counter stats on bind-mount, fp-error, personality and resources stressors
  • Fix the --fiemap-bytes option
  • Fix up build warnings with various compilers and static analyzers
The major change to stress-ng over the past month was an internal re-working of system call and GNU features to abstract these into a shim layer to reduce the number build conditional #ifdef paths around code. This simplifies portability, so the code now builds more easily across a range of systems and with various versions of gcc and clang and fixes some issues on older kernels too.   This makes the code also faster to statically analyze with cppcheck.

For more details, visit the stress-ng project page or the quick help guide.

Sunday, 28 August 2016

Fixing an overheating Lenovo X230 laptop

Over the past month I've been hitting excessive thermal heating on my laptop and kidle_inject has been kicking in to try and stop the CPU from overheating (melting!).  A quick double-check with older kernels showed me that this issue was not thermal/performance regression caused by software - instead it was time to clean my laptop and renew the thermal paste.

After some quick research, I found that Artic MX-4 Thermal Compound provided an excellent thermal conductivity rating of 8.5W/mK so I ordered a 4g sample as well as a can of pressurized gas cleaner to clean out dust.

The X230 has an excellent hardware maintenance manual, and following the instructions I stripped the laptop right down so I could pop the heat pipe contacts off the CPU and GPU.  I carefully cleaned off the old dry and cracked thermal paste and applied about 0.2g of MX-4 thermal compound to the CPU and GPU and re-seated the heat pipe.  With the pressurized gas I cleaned out the fan and airways to maximize airflow over the heatpipe.   The entire procedure took about an hour to complete and for once I didn't have any screws left over after re-assembly!

I normally take photos of the position of components during the strip down of a laptop for reference in case I cannot figure out exactly how parts are meant to fix on the re-assembly phase.  In this case, the X230 maintenance manual is sufficiently detailed so I didn't take any photos this time.

I'm glad to report that my X230 is now no-longer overheating. Heat is being effectively pumped away from the CPU and GPU and one can feel the additional heat being pushed out of the laptop.  Once again I can fully max out the CPU and GPU without passive thermal cooling mechanisms being kicked into action, so I've now got 100% of my CPU performance back again; as good as new!

Now and again I see laptop overheating bugs being filed in LaunchPad.  While some are legitimate issues with broken software, I do wonder if the majority of issues with the older laptops is simply due to accumulation of dust and/or old and damaged thermal paste.

Saturday, 30 July 2016

Scanning the Linux kernel for error messages

The Linux kernel contains lots of error/warning/information messages; over 130,000 in the current 4.7 kernel.  One of the tests in the Firmware Test Suite (FWTS) is to find BIOS/ACPI/UEFI related kernel error messages in the kernel log and try to provide some helpful advice on each error message since some can be very cryptic to the untrained eye.

The FWTS kernel error log database is currently approaching 800 entries and I have been slowly working through another 800 or so more relevant and recently added messages.  Needless to say, this is taking a while to complete.  The hardest part was finding relevant error messages in the kernel as they appear in different forms (e.g. printk(), dev_err(), ACPI_ERROR() etc).

In order to scrape the Linux kernel source for relevant error messages I hacked up the kernelscan parser to find error messages and dump these to stdout.  kernelscan can scan 43,000 source files (17,900,000 lines of source) in under 10 seconds on my Lenovo X230 laptop, so it is relatively fast.

I also have been using kernelscan to find spelling mistakes in kernel messages and I've been punting trivial fixes upstream to fix these.  These mistakes are small and petty, but I find it a little irksome when I see the kernel emit a message that contains a typo or spelling mistake - it just looks a bit unprofessional.

I've created a kernelscan snap (which was really easy and fast to do using scancraft), so it is now available Ubuntu.  The source code is also available from the kernel team git web at http://kernel.ubuntu.com/git/cking/kernelscan.git/

The code is designed to only parse kernel source, and it is a very rough and ready parser designed for speed;  fundamentally, it is a big quick hack.  When I get a few spare minutes I will try and see if there is any correlation between the number of error messages with the size of the kernel over the various releases.

Wednesday, 29 June 2016

What's new in stress-ng 0.06.07?

Since my last blog post about stress-ng, I've pushed out several more small releases that incorporate new features and (as ever) a bunch more bug fixes.  I've been eyeballing gcov kernel coverage stats to find more regions in the kernel where stress-ng needs to exercise.   Also, testing on a range of hardware (arm64, s390x, etc) and a range of kernels has eeked out some bugs and helped me to improve stress-ng.  So what's new?

New stressors:
  • ioprio  - exercises ioprio_get(2) and ioprio_set(2) (I/O scheduling classes and priorities)
  • opcode - generates random object code and executes this, generating and catching illegal instructions, bus errors,  segmentation  faults,  traps and floating  point errors.
  • stackmmap - allocates a 2MB stack that is memory mapped onto a temporary file. A recursive function works down the stack and flushes dirty stack pages back to the memory mapped file using msync(2) until the end of the stack is reached (stack overflow). This exercises dirty page and stack exception handling.
  • madvise - applies random madvise(2) advise settings on pages of a 4MB file backed shared memory mapping.
  • pty - exercise pseudo terminal operations.
  • chown - trivial chown(2) file ownership exerciser.
  • seal - fcntl(2) file SEALing exerciser.
  • locka - POSIX advisory locking exerciser.
  • lockofd - fcntl(2) F_OFD_SETLK/GETLK open file description lock exerciser.
Improved stressors:
  • msg: add in IPC_INFO, MSG_INFO, MSG_STAT msgctl calls
  • vecmath: add more ops to make vecmath more demanding
  • socket: add --sock-type socket type option, e.g. stream or seqpacket
  • shm and shm-sysv: add msync'ing on the shm regions
  • memfd: add hole punching
  • mremap: add MAP_FIXED remappings
  • shm: sync, expand, shrink shm regions
  • dup: use dup2(2)
  • seek: add SEEK_CUR, SEEK_END seek options
  • utime: exercise UTIME_NOW and UTIME_OMIT settings
  • userfaultfd: add zero page handling
  • cache:  use cacheflush() on systems that provide this syscall
  • key:  add request_key system call
  • nice: add some randomness to the delay to unsync nicenesses changes
If any new features land in Linux 4.8 I may add stressors for them, but for now I suspect that's about it for the big changes for stress-ng for the Ubuntu Yakkey 16.10 release.