Disable hyper-threading on Intel Skylake & Kaby Lake CPU

In 4½ months I have experienced 16 BSOD system crashes on a new work computer:

Crash Date Bug Check String Bug Check Code Caused By Address
21-06-2017 DRIVER_POWER_STATE_FAILURE 0x0000009f ntoskrnl.exe+70e40
12-06-2017 NTFS_FILE_SYSTEM 0x00000024 Ntfs.sys+4211
23-05-2017 IRQL_NOT_LESS_OR_EQUAL 0x0000000a ntoskrnl.exe+6f4c0
10-05-2017 IRQL_NOT_LESS_OR_EQUAL 0x0000000a ntoskrnl.exe+6f440
01-05-2017 BAD_POOL_HEADER 0x00000019 win32k.sys+f13b2
24-03-2017 BAD_POOL_CALLER 0x000000c2 ntoskrnl.exe+6f440
17-03-2017 SYSTEM_SERVICE_EXCEPTION 0x0000003b afd.sys+41448
14-03-2017 MEMORY_MANAGEMENT 0x0000001a ntoskrnl.exe+70400
13-03-2017 PAGE_FAULT_IN_NONPAGED_AREA 0x00000050 VBoxDrv.sys+1f037
10-03-2017 PFN_LIST_CORRUPT 0x0000004e ntoskrnl.exe+70400
02-03-2017 SYSTEM_SERVICE_EXCEPTION 0x0000003b ntoskrnl.exe+70400
22-02-2017 BAD_POOL_CALLER 0x000000c2 TDI.SYS+10be
17-02-2017 BAD_POOL_HEADER 0x00000019 ntoskrnl.exe+70400
16-02-2017 SYSTEM_THREAD_EXCEPTION_NOT_HANDLED 0x1000007e iusb3xhc.sys+7dfb0
08-02-2017 PAGE_FAULT_IN_NONPAGED_AREA 0x00000050 ntoskrnl.exe+70400
07-02-2017 PFN_LIST_CORRUPT 0x0000004e ntoskrnl.exe+70400

 

Until now I have:

  • Performed multiple memory tests.
  • Checked SSD health.
  • Checked system files.
  • Examined multiple memory dumps with WinDbg.
  • Installed all relevant firmware and driver updates.
  • Scanned for malware.

 

However this has not been successful or revealed the real cause behind the problems.

 

I eventually decided to replace the original memory modules:

2 x 8 GB DDR4-2133 CL15, Kingston KVR21N15D8K2/16

With:

2 x 8 GB DDR4-2133 CL15, Crucial CT8G4DFS8213.C8FDR1

 

This seemed to help somewhat.

System crashes used to be a semiweekly event.

After replacing the memory modules it became a semimonthly event.

 

The system has an Intel Skylake CPU (Core i7-6700)

 

It has recently been discovered that some Intel Skylake and Kaby Lake CPU’s have a hardware bug related to hyper-threading.

The bug is described in: 6th Generation Intel® Processor Family – Specification Update

Quote:

“Under complex micro-architectural conditions, short loops of less than 64 instructions that use AH, BH, CH or DH registers as well as their corresponding wider register (e.g. RAX, EAX or AX for AH) may cause unpredictable system behavior. This can only happen when both logical processors on the same physical processor are active.”

 

Until system vendors include microcode fixes in firmware/UEFI updates, the only workaround is to disable hyper-threading.

 

The stability problems I have experienced could be caused by this CPU hardware bug.

So I have disabled hyper-threading in BIOS/UEFI setup and will await firmware updates. I hope that the system will finally be stable and reliable.

Conclusion

If you have an Intel Skylake or Kaby Lake CPU, it is recommend to disable hyper-threading for now.

USB driver problem preventing access to Samsung Android devices

I recently experienced problems connecting to Samsung Android devices with Android Studio from my work computer.

No Connected Devices were available.

 

I checked Device Manager and noticed a warning for: SAMSUNG Mobile USB Composite Device

 

Checked Properties and noticed the Device status:

Windows cannot load the device driver for this hardware. The driver may be corrupted or missing. (Code 39)

 

I checked driver details and noticed that the driver was unexpectedly: usbpcap.sys.

(The problem occured after installing Wireshark and USBPcap…)

 

I decided to uninstall USBPcap. However this didn’t solve the problem, but changed the message for driver details to:

No driver files are required or have been loaded for this device.

 

Fixed the problem this way:

1. Clicked: Update Driver…

 

2. Clicked: Browse my computer for driver software

 

3. Clicked: Let me pick from a list of device drivers on my computer

 

4. Selected: SAMSUNG Mobile USB Composite Device Version: 2.12.4.0 [24-08-2016]

 

5. Clicked: Next

6. Noticed the message: Windows has successfully updated your driver software

 

7. Checked driver details, which now had the desired driver file:

C:\Windows\system32\DRIVERS\ssudbus.sys

 

This fixed the problem. It was again possible to connect to Samsung Android devices from Android Studio.

Examining PFN_LIST_CORRUPT (4e) and PAGE_FAULT_IN_NONPAGED_AREA (50) BSOD

I recently experienced stability problems on a new work computer, which crashed with a BSOD.

 

I looked for clues in Event Viewer and found:

Log Name:      System
Source:        Microsoft-Windows-WER-SystemErrorReporting
Event ID:      1001
Task Category: None
Level:         Error
Keywords:      Classic
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x0000004e (0x0000000000000099, 0x00000000003def55, 0x0000000000000000, 0x0000000000000001). A dump was saved in: C:\Windows\MEMORY.DMP.

 

Examined the memory dump with WinDbg (x64).

Checked for details about the crash with:

!analyze -v

Part of the result:

PFN_LIST_CORRUPT (4e)
Typically caused by drivers passing bad memory descriptor lists (ie: calling
MmUnlockPages twice with the same list, etc).  If a kernel debugger is
available get the stack trace.
Arguments:
Arg1: 0000000000000099, A PTE or PFN is corrupt
Arg2: 00000000003def55, page frame number
Arg3: 0000000000000000, current page state
Arg4: 0000000000000001, 0

 

Examined the call stack with:

kp

Result:

Child-SP          RetAddr           Call Site
fffff880`030b34f8 fffff800`0311c37c nt!KeBugCheckEx
fffff880`030b3500 fffff800`03038c17 nt!MiBadShareCount+0x4c
fffff880`030b3540 fffff800`030bc057 nt! ?? ::FNODOBFM::`string'+0x2cf6d
fffff880`030b36f0 fffff800`030bda09 nt!MiDeleteVirtualAddresses+0x41f
fffff880`030b38b0 fffff800`033a9f21 nt!MiRemoveMappedView+0xd9
fffff880`030b39d0 fffff800`033aa323 nt!MiUnmapViewOfSection+0x1b1
fffff880`030b3a90 fffff800`03089693 nt!NtUnmapViewOfSection+0x5f
fffff880`030b3ae0 00000000`76febfda nt!KiSystemServiceCopyEnd+0x13
00000000`0a8df5d8 00000000`00000000 0x76febfda

 

Memory problems are typically caused by failing memory modules, so I tested the memory with Memtest86+.

Only had time for running it for a short time, but it passed the test once.

However the next day the computer crashed again with another BSOD…

 

I found this in Event Viewer:

Log Name:      System
Source:        Microsoft-Windows-WER-SystemErrorReporting
Event ID:      1001
Task Category: None
Level:         Error
Keywords:      Classic
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x00000050 (0xfffff8a0384b1280, 0x0000000000000000, 0xfffff800031fe133, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP.

 

Examined the new memory dump with:

!analyze -v

Part of the result:

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except,
it must be protected by a Probe.  Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: fffff8a0384b1280, memory referenced.
Arg2: 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3: fffff800031fe133, If non-zero, the instruction address which referenced the bad memory
address.
Arg4: 0000000000000000, (reserved)

 

Examined the call stack with:

kp

Result:

Child-SP          RetAddr           Call Site
fffff880`031735f8 fffff800`031442be nt!KeBugCheckEx
fffff880`03173600 fffff800`030c552e nt! ?? ::FNODOBFM::`string'+0x3bc5f
fffff880`03173760 fffff800`031fe133 nt!KiPageFault+0x16e
fffff880`031738f0 fffff800`030af3b1 nt!ExFreePoolWithTag+0x43
fffff880`031739a0 fffff880`018450c6 nt!FsRtlUninitializeBaseMcb+0x41
fffff880`031739d0 fffff800`030d0355 Ntfs!NtfsMcbCleanupLruQueue+0xf6
fffff880`03173b70 fffff800`03362236 nt!ExpWorkerThread+0x111
fffff880`03173c00 fffff800`030b8706 nt!PspSystemThreadStartup+0x5a
fffff880`03173c40 00000000`00000000 nt!KxStartSystemThread+0x16

 

Another BSOD related to memory access strongly indicated problems with the memory modules.

Ran Memtest86+ overnight for 15+ hours.

The next day Memtest86+ had found 160 memory errors…

 

I decided to reseat the memory modules.

Then ran Memtest86+ overnight again for almost 16 hours.

The next day no memory errors were found.

Hoping that the cause and solution for the BSOD crashes has been found. Time will tell.

Examining MEMORY_MANAGEMENT (1a) BSOD

A Lenovo Thinkpad T440p computer recently crashed with a BSOD.

I started looking for clues in Event Viewer and found:

Log Name:      System
Source:        Microsoft-Windows-WER-SystemErrorReporting
Event ID:      1001
Task Category: None
Level:         Error
Keywords:      Classic
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x0000001a (0x0000000000041792, 0xffff808000082f70, 0x0004000000000000, 0x0000000000000000). A dump was saved in: C:\WINDOWS\MEMORY.DMP.

 

Decided to examine the memory dump, so started WinDbg (x64) and opened:

C:\Windows\Memory.dmp

This message was displayed:

BugCheck 1A, {41792, ffff808000082f70, 4000000000000, 0}

Probably caused by : memory_corruption

 

Checked for more details with:

!analyze -v

Part of the result:

*************************************************************
*                                                           *
*                    Bugcheck Analysis                      *
*                                                           *
*************************************************************

MEMORY_MANAGEMENT (1a)
# Any other values for parameter 1 must be individually examined.
Arguments:
Arg1: 0000000000041792, A corrupt PTE has been detected. Parameter 2 contains the address of
the PTE. Parameters 3/4 contain the low/high parts of the PTE.
Arg2: ffff808000082f70
Arg3: 0004000000000000
Arg4: 0000000000000000

 

This issue indicated hardware failure, most likely defective memory.

So I booted Memtest86+ from a USB drive.

Within few minutes it found multiple errors.

 

Tried cleaning the contacts on the memory modules, but it had no effect.

 

Then I tested each memory module separately in both sockets.

In every case the memory test found errors.

 

Decided to test another 8 GB memory module.

Ran Memtest86+ all night and it found no errors on the replacement memory module.

Conclusion

A computer that crashes with a MEMORY_MANAGEMENT (1a) BSOD likely has defective memory.

Test the memory with Memtest86+ or another testprogram.

Then replace any identified defective memory modules.

Cleaning computer cooling system to improve performance

While using a laptop computer I noticed high noise levels, caused by the cooling fan.

I decided to check the CPU temperatures using HWMonitor.

The idle temperatures were around 70° C.

And load temperatures were around 85°-94° C.

 

These temperatures could be high enough to cause thermal throttling, which would affect system performance.

Therefore I measured performance with 7-Zip and CPU-Z benchmarks.

 

I decided to disassemble the laptop computer and clean the cooling system, using compressed air.

(Be aware that most compressed air cans contain harmful chemicals, so use them outside or in a well ventilated area)

 

Cleaning the cooling system had a significant effect.

Now idle temperatures were much lower, around 55° C.

And load temperatures were also lower, around 74°-85° C.

Also noticed much lower fan speeds, so the computer wasn’t as noisy as before.

 

Table of thermal readings:

Idle Load
Before cleaning 70° C 85°-94°
After cleaning 55° C 74°-85° C
Improvement 15° C 9°-11° C

 

The system had been affected by thermal throttling, because the 7-Zip and CPU-Z benchmarks improved.

 

Table of benchmark results:

7-Zip score CPU-Z single CPU-Z multi
Before cleaning 12491 1321 4572
After cleaning 16572 1510 5494
Improvement 32,6% 14,3% 20,2%

 

Conclusion

It can be relevant to clean a computers cooling system.

It may improve noise levels, temperatures and performance.

Recommending heatsink for Raspberry Pi 3

When experimenting with a Raspberry Pi 3 I noticed that the CPU could get pretty warm.

I decided to measure the temperature during idle and load with:

cat /sys/class/thermal/thermal_zone0/temp

And load tested the CPU with 7-zip benchmark:

7zr b

 

Measured these temperatures:

Idle Load
Without heatsink 50-52 °C 77-79 °C
With heatsink 47-49 °C 65-67 °C
Improvement 3 °C 12 °C

 

I used a small aluminium heatsink with thermal tape:

raspberry_pi_3_with_heatsink

 

Conclusion

The 12 degree reduction in load temperature is well worth the effort.

It should prevent thermal throttling and thereby ensure optimal performance.

Separate, physical trackpoint buttons on Lenovo Thinkpad T440p

The Lenovo Thinkpad T440p (and other models of that generation) is delivered with a touchpad, without separate physical left, middle and right buttons.

Instead the entire pad clicks and reacts depending on the area touched.

 

In my subjective opinion these buttons feel spongy and imprecise.

In use it’s common to make mistakes by clicking another button than expected.

This makes the laptop less productive and frustrating to use.

 

However it’s possible to replace the touchpad with the one from the Lenovo Thinkpad T450, which has 3 separate, physical trackpoint buttons.

thinkpad_t440p_with_t450_trackpad

 

The first challenge is getting the right replacement part, with the dimensions 10 cm x 7,5 cm.

thinkpad_t440p_clickpad_horizontal

thinkpad_t440p_clickpad_vertical

It’s not available as a separate part from Lenovo, but is sold as part of the keyboard bezel.

The part number I found and used was: 00HN550

 

Be careful with online sellers claiming to sell touchpads that fit a long range of Thinkpad models.

They may fit electrically, but possibly not physically.

If you are considering performing this replacement, please verify that the part fits your particular Thinkpad model.

 

The next challenge is to disassemble the laptop and performing the replacement.

I refer to the hardware maintenance manual and online guides.

 

The final challenge is to solve driver problems on Windows.

The hardware ID for the touchpad is on the motherboard, which remains unchanged.

The default Synaptics Pointing Device drivers are not compatible and won’t work.

 

Simplest way to solve the driver problems on Windows:

1. Connect a USB mouse, because the trackpoint won’t work reliably until these steps have been completed.

2. Uninstall the Synaptics Pointer Device drivers using Programs and Features.

3. Restart the computer.

4. Remove any remaining Synaptics components by opening Control Panel -> Mouse

If asked: Do you want to uninstall the Synaptics driver now?

Then select yes and OK to the following dialogs:

synaptics_driver_uninstall1

synaptics_driver_uninstall2

synaptics_driver_uninstall3

5. Restart the computer.

6. Now the trackpoint and 3 physical buttons should work with a default mouse driver.

(Be aware that I have disabled the rest of the touchpad, so I don’t know if it works with the default mouse driver)

 

With Windows 10 extra steps are needed, because it can automatically install incompatible drivers.

This can be prevented by downloading and running the “Show or hide updates” program (wushowhide.diagcab) from:

https://support.microsoft.com/en-us/kb/3073930?utm_source=twitter

 

1. Click: Advanced

disable_driver_updates1

2. Deselect: Apply repairs automatically

disable_driver_updates2

3. Click: Next

4. Click: Hide updates

disable_driver_updates3

5. Select: Synaptics – Pointing Drawing – Synaptics Pointing Device

disable_driver_updates4

6. Click: Next

7. Confirm by clicking: Next

disable_driver_updates5

8. Click: Close the troubleshooter

disable_driver_updates6

 

Be aware that fully compatible drivers can be downloaded and installed from Lenovo, which will enable full touchpad functionality.

However I’m currently satisfied with a trackpoint and 3 physical buttons, so I have not found the correct drivers or procedure yet.

Laptop computer freezes when power supply is connected

I recently experienced a problem on a Lenovo Thinkpad T440p computer running Windows 10:

If the power supply was connected when the computer was running it would seemingly freeze: Mouse and keyboard became non-responsive.

However it was not a full freeze or crash, because music from a mediaplayer would continue.

If the power supply was disconnected, then the computer became responsive again.

 

One way to avoid the problem was to connect the power when the computer was sleeping or turned off.

 

It was an annoying problem and some people would assume the computer had crashed.

So I searched for a solution and found this:

https://www.ifixit.com/Answers/View/70872/How+to+fix+a+notebook+that+freezes+when+plugged+to+AC+power

 

The solution that worked for me was to:

  1. Open Device Manager
  2. Find Batteries -> Microsoft AC Adapter
  3. Right click and disable the Microsoft AC Adapter

2016-09-12_laptop_freezes_when_connected_to_power

Reseat and clean contacts on memory modules

Computer memory can fail and it usually causes reliability problems, unless ECC memory is used.

My preferred memory testing tool is: Memtest86+

 

Normal procedure is to test memory modules individually in different memory sockets, to identify the failing memory module or socket.

This entails reseating the memory modules. (Before handling memory modules please take anti-static precautions)

Modern memory buses use high frequency low voltage signals. They need a good electrical connection to work reliably.

Sometimes the process of reseating the memory modules can solve the problem, if it was caused by an electrical connection issue.

 

If memory tests continue to fail and the memory module is out of warranty, before replacing it you can try cleaning the contacts on the memory module for dust, dirt or corrosion.

I suggest using a piece of cloth with rubbing alcohol.

After cleaning, test the memory module again. If memory tests can run without errors for 24 hours, then the problem is likely fixed.

Examine WHEA_UNCORRECTABLE_ERROR (124) BSOD with WinDbg

One of my computers recently crashed with a BSOD.

This occurs very rarely so I decided to identify the cause.

Troubleshooting

I checked the system event log for a bugcheck and found this:

Log Name:      System
Source:        Microsoft-Windows-WER-SystemErrorReporting
...
Description:
The computer has rebooted from a bugcheck.  The bugcheck was: 0x00000124 (0x0000000000000000, 0xfffffa800d91b038, 0x00000000b2004000, 0x0000000029000175). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 040116-21216-01.

 

I decided to examine the C:\Windows\MEMORY.DMP crash dump with WinDbg. (In this case the x64 version of WinDbg)

WinDbg’s !analyze command usually reveals relevant information about a BSOD, so that’s what I checked first:

Run: !analyze -v

0: kd> !analyze -v

...
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800d91b038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b2004000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000029000175, Low order 32-bits of the MCi_STATUS value.

...

PRIMARY_PROBLEM_CLASS:  X64_0x124_AuthenticAMD_PROCESSOR_CACHE

 

It seemed to be a hardware error related to the processor cache.

For more details I looked at the WHEA_ERROR_RECORD information:

(Only section 2 with the actual error shown)

0: kd> !errrec fffffa800d91b038
===============================================================================
Common Platform Error Record @ fffffa800d91b038
-------------------------------------------------------------------------------
Record Id     : 01d17af4b6b560a4
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 4/1/2016 18:19:44 (UTC)
Flags         : 0x00000000

...

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa800d91b148
Section       @ fffffa800d91b2d0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_EVICT_ERR (Proc 0 Bank 0)
Status      : 0xb200400029000175

 

Apparently a hardware error releated to the level 1 data cache caused the system crash.

 

The computer in question has an AMD Athlon II X2 280 CPU.

Using CPU-Z I noticed that the core voltage seemed a little low for this CPU.

I remembered that I had undervolted the CPU to save power.

(Did not have reliability problems with it for years until now)

 

I checked the BIOS settings and discovered that the CPU was undervolted by -0,15 volts.

I decided to change it to -0,1 volts.

If any other reliability problems occur, I will change it back to standard voltage.

Conclusion

If hardware is running out of specifications and system crashes occur, then adjust settings closer to specifications.

(Examples of running out of spec: Undervolting, overvolting and overclocking)