やるきなし

2016/07/15 12:01 / CacheFiles BUG (Linux 4.6.x)

Since 4.5.7 is the last release of 4.5 series, I tried to update my server to 4.6.4. However, the following disturbs my upgrading to 4.6.4.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffff8127343f>] cachefiles_mark_object_inactive+0x4f/0xa0
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: dm_crypt dm_mod algif_skcipher af_alg rpcsec_gss_krb5 overlay squashfs ext4 crc16 jbd2 mbcache loop cpufreq_powersave cpufreq_userspace cpufreq_conservative cpufreq_ondemand cpufreq_stats binfmt_misc nfsd btrfs xor raid6_pq sg sr_mod cdrom sd_mod intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha256_generic snd_hda_codec_hdmi drbg ansi_cprng hid_generic uas usbhid snd_hda_codec_realtek usb_storage snd_hda_codec_generic hid snd_hda_intel aesni_intel snd_hda_codec aes_x86_64 iTCO_wdt snd_hda_core ablk_helper iTCO_vendor_support snd_pcm ahci cryptd snd_timer libahci lrw nvidiafb evdev snd gf128mul libata vgastate soundcore glue_helper lpc_ich mei_me scsi_mod pcspkr acpi_cpufreq serio_raw i2c_i801 mfd_core mei shpchp tpm_tis rtc_cmos tpm button processor md_mod fbcon bitblit fbcon_rotate fbcon_ccw fbcon_ud fbcon_cw softcursor tileblit lm78 hwmon_vid f71882fg i5k_amb coretemp msr nvidia(PO) drm agpgart fuse autofs4 crc32c_intel xhci_pci ehci_pci xhci_hcd ehci_hcd usbcore usb_common
CPU: 2 PID: 926 Comm: kworker/u16:4 Tainted: P           O    4.6.4-myn-01+ #95
Hardware name: OEM something
Workqueue: fscache_object fscache_object_work_func
task: ffff88042c02f2c0 ti: ffff880425fa4000 task.ti: ffff880425fa4000
RIP: 0010:[<ffffffff8127343f>]  [<ffffffff8127343f>] cachefiles_mark_object_inactive+0x4f/0xa0
RSP: 0018:ffff880425fa7d90  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88042ac14400 RCX: 0000000000000034
RDX: ffff88043f7ceb10 RSI: ffff8803c219f390 RDI: ffff88043f7ceb08
RBP: ffff8803c219f280 R08: 0000000000000000 R09: 00000000000000ec
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88042ca2f000
R13: ffff88042ac14400 R14: 00000000ffffffff R15: 00000000ffffb96f
FS:  0000000000000000(0000) GS:ffff88043f480000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 0000000001c06000 CR4: 00000000000406e0
Stack:
 ffff8803c219f280 ffff88042ac14400 ffffffff81271c75 ffff8803c219f280
 ffff880418c4c140 ffff8803bf985a50 ffffffff8120a74a ffff8803c219f310
 ffffffff81827580 ffff8803c219f280 000000000000006d ffffffff8120aae9
Call Trace:
 [<ffffffff81271c75>] ? cachefiles_drop_object+0xd5/0x180
 [<ffffffff8120a74a>] ? fscache_drop_object+0xda/0x260
 [<ffffffff8120aae9>] ? fscache_object_work_func+0xf9/0x460
 [<ffffffff8106e925>] ? process_one_work+0x135/0x3c0
 [<ffffffff8106ec0d>] ? worker_thread+0x5d/0x470
 [<ffffffff8106ebb0>] ? process_one_work+0x3c0/0x3c0
 [<ffffffff81073e2a>] ? kthread+0xca/0xe0
 [<ffffffff81603402>] ? ret_from_fork+0x22/0x40
 [<ffffffff81073d60>] ? kthread_create_on_node+0x170/0x170
Code: 11 09 00 48 8d bd 10 01 00 00 f0 80 a5 10 01 00 00 fe c6 83 20 01 00 00 00 31 f6 e8 7c b7 e1 ff 48 8b 85 f8 00 00 00 48 8b 40 30 <48> 8b 80 98 00 00 00 f0 48 01 83 30 01 00 00 b8 01 00 00 00 f0
RIP  [<ffffffff8127343f>] cachefiles_mark_object_inactive+0x4f/0xa0
 RSP <ffff880425fa7d90>
CR2: 0000000000000098
---[ end trace 0b204b8eafde5e55 ]---

Function cachefiles_mark_object_inactive is introduced at CacheFiles: Provide read-and-reset release counters for cachefilesd. So when I revert this commit, my linux 4.6.4 system works fine.

P.S.(2016/7/16)

My system is Debian GNU/Linux stable (jessie). The version of cachefilesd is 0.10.5-1. I may need to update this to the latest one, 0.10.9, which includes Suspend/resume culling based on recently released file/block counts.

P.S.(2016/8/23)

Altough it seems that this patch (4.7.2 version) might be related to this issue, I still got the following Oops.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffffa0bc386c>] cachefiles_mark_object_inactive+0x1c/0xa0 [cachefiles]
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: rpcsec_gss_krb5 overlay squashfs ext4 crc16 jbd2 mbcache dm_crypt loop algif_skcipher af_alg dm_mod cpufreq_powersave cpufreq_userspace cpufreq_conservative cpufreq_ondemand cpufreq_stats binfmt_misc nfsd btrfs xor raid6_pq sg sr_mod cdrom sd_mod intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel drbg snd_hda_codec_hdmi ansi_cprng hid_generic usbhid uas snd_hda_codec_realtek snd_hda_codec_generic hid usb_storage snd_hda_intel aesni_intel snd_hda_codec aes_x86_64 snd_hda_core ablk_helper cryptd snd_pcm iTCO_wdt lrw ahci iTCO_vendor_support snd_timer gf128mul nvidiafb libahci evdev snd glue_helper libata vgastate soundcore pcspkr serio_raw mei_me i2c_i801 scsi_mod acpi_cpufreq lpc_ich mei shpchp mfd_core rtc_cmos tpm_tis tpm button processor cachefiles fbcon bitblit fbcon_rotate fbcon_ccw fbcon_ud fbcon_cw softcursor tileblit md_mod lm78 hwmon_vid f71882fg i5k_amb coretemp msr nvidia(PO) drm agpgart fuse autofs4 crc32c_intel xhci_pci ehci_pci xhci_hcd ehci_hcd usbcore usb_common
CPU: 2 PID: 938 Comm: kworker/u16:6 Tainted: P           O    4.7.2-myn-01+ #101
Hardware name: OEM something
Workqueue: fscache_object fscache_object_work_func
task: ffff88042c6a0f00 ti: ffff880426668000 task.ti: ffff880426668000
RIP: 0010:[<ffffffffa0bc386c>]  [<ffffffffa0bc386c>] cachefiles_mark_object_inactive+0x1c/0xa0 [cachefiles]
RSP: 0018:ffff88042666bd88  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88042a9b2200 RCX: 0000000200000000
RDX: ffff88042c6a0f00 RSI: ffff8803e8d122c0 RDI: ffff88042a9b2320
RBP: ffff8803e8d122c0 R08: 0000000000000000 R09: 00000000000000ec
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88042ca08f00
R13: ffff88042a9b2200 R14: 00000000ffffffff R15: 00000000ffff0e67
FS:  0000000000000000(0000) GS:ffff88043f480000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 0000000001c06000 CR4: 00000000000406e0
Stack:
 ffff8803e8d122c0 ffff88042a9b2200 ffff88042ca08f00 ffffffffa0bc20d5
 ffff8803e8d122c0 ffff88041a336780 ffff8800ae5fb230 ffffffff81213e7a
 ffff8803e8d12350 ffffffff8182d900 ffff8803e8d122c0 000000000000006d
Call Trace:
 [<ffffffffa0bc20d5>] ? cachefiles_drop_object+0xd5/0x180 [cachefiles]
 [<ffffffff81213e7a>] ? fscache_drop_object+0xda/0x260
 [<ffffffff81214219>] ? fscache_object_work_func+0xf9/0x460
 [<ffffffff810738f5>] ? process_one_work+0x135/0x3c0
 [<ffffffff81073bdd>] ? worker_thread+0x5d/0x470
 [<ffffffff81073b80>] ? process_one_work+0x3c0/0x3c0
 [<ffffffff81078dda>] ? kthread+0xca/0xe0
 [<ffffffff8161063f>] ? ret_from_fork+0x1f/0x40
 [<ffffffff81078d10>] ? kthread_create_on_node+0x170/0x170
Code: 8d fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 48 89 f5 53 48 8b 86 f8 00 00 00 48 89 fb 48 8d bf 20 01 00 00 48 8b 40 30 <4c> 8b a0 98 00 00 00 e8 88 ca a4 e0 48 8d bd 28 01 00 00 48 8d
RIP  [<ffffffffa0bc386c>] cachefiles_mark_object_inactive+0x1c/0xa0 [cachefiles]
 RSP <ffff88042666bd88>
CR2: 0000000000000098
---[ end trace b0980ebb0851d8ca ]---

P.S.(2016/11/2)

This bug is finally fixed by https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/fs/cachefiles?id=a818101d7b92e76db2f9a597e4830734767473b9. This patch is included in Linux 4.8.4 as 336f2e1ef8d52fac6420aff8d50191fc81c0c4ec .