TL;DR:测试ELK容器挂载共享NFS存储snapshot备份, 因网络策略调整导致nfs服务器无法访问.
执行df命令卡住无法检查nfs状态, 在node上直接umount -lf 并通过docker登陆pod对应的容器 umount -lf后 df正常.
临时修改yaml文件, 注释nfs挂载和pvc定义后重新部署节点.
1.问题现象
ELK snapshot备份失败, 登陆pod执行df卡住.
k8s报events
Unable to mount volumes for pod "elklogsvc-data-1-7f846f45c8-nnqd5_elk-logsvc-uat(53acafec-4660-11ea-b3ac-98039b885796)": timeout expired waiting for volumes to attach or mount for pod "elk-logsvc-uat"/"elklogsvc-data-1-7f846f45c8-nnqd5". list of unmounted volumes=[es-snapshort]. list of unattached volumes=[es-uat-data eslog escert jdk-secpolicy es-snapshort default-token-vlghg]
重新部署pod后, 老pod无法remove, 新pod为panding.
2.问题分析
ping nfs服务器不通, 应该是网络策略有调整.
登陆host, 检查nfs挂载都还在. 使用umount -f 无法摘掉. nfs 服务端没有问题.
mount | grep nfs
nfs_ip:/neworiental/nfs on /var/lib/kubelet/pods/3cc68af0-cb04-11e9-ab84-98039b88726a/volumes/kubernetes.io~nfs/pv-elklogsvcuat-snapshot-nfs type nfs4 (rw,relatime,vers=4.1,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.22.29.203,local_lock=none,addr=172.24.202.25)
nfs_ip:/neworiental/nfs on /var/lib/kubelet/pods/3779b2d1-cb05-11e9-ab84-98039b88726a/volumes/kubernetes.io~nfs/pv-elklogsvcuat-snapshot-nfs type nfs4 (rw,relatime,vers=4.1,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.22.29.203,local_lock=none,addr=172.24.202.25)
3.问题解决
3.1 在host上强制摘掉nfs文件系统
umount -lf /var/lib/kubelet/pods/3cc68af0-cb04-11e9-ab84-98039b88726a/volumes/kubernetes.io~nfs/pv-elklogsvcuat-snapshot-nfs
摘除后,之前的pod仍然无法remove掉. 想了想因为在不同的mount 命名空间内,因此还需要在容器内进行摘除nfs操作.
3.2 通过docker直接切换到root用户进行umount
docker ps | grep elk | grep coo
af8c30434775 dir.staff.xdf.cn/xdf-pub/elasticsearch "/usr/local/bin/dock…" 8 months ago Up 8 months k8s_elklogsvc-coo-3_elklogsvc-coo-3-544d46d664-r8r67_elk-logsvc-uat_3779b2d1-cb05-11e9-ab84-98039b88726a_0
bc35792cd4a0 dir.staff.xdf.cn/google_containers/pause:3.1 "/pause" 8 months ago Up 8 months k8s_POD_elklogsvc-coo-3-544d46d664-r8r67_elk-logsvc-uat_3779b2d1-cb05-11e9-ab84-98039b88726a_0
# 得到容器id
af8c30434775
# umount
docker exec -u root -it --privileged af8c30434775 /bin/bash
umount -lf /usr/share/elasticsearch/snapshort
# 容器内df正常
df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 103081248 23826408 74848944 25% /
tmpfs 65536 0 65536 0% /dev
tmpfs 131887444 0 131887444 0% /sys/fs/cgroup
/dev/mapper/vg_root-lv_root 480486104 11335796 444719924 3% /etc/hosts
/dev/mapper/vg_root-lv_docker 103081248 23826408 74848944 25% /etc/hostname
shm 65536 0 65536 0% /dev/shm
/dev/mapper/vg_root-lv_new 3097960600 1556053352 1405997056 53% /usr/share/elasticsearch/data
tmpfs 131887444 12 131887432 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 131887444 4 131887440 1% /usr/share/elasticsearch/certs/logsvcuat-elastic.p12
tmpfs 131887444 0 131887444 0% /proc/acpi
tmpfs 131887444 0 131887444 0% /proc/scsi
tmpfs 131887444 0 131887444 0% /sys/firmware
3.3 重新部署POD
调整后检查kubelet日志. 继续报nfs挂载错误, 等了一会儿老pod被kubelet干掉, 新pod启动. 全部恢复.
其他
检查kubelet时发现孤儿进程.
检查这些孤儿pod, 进入到 /var/lib/kubelet/pods/40ccb5cc-919c-11ea-8800-98039b88740f
发现pod还是elk的pod, 应该是老pod remove后重新调度上去的pod,因为nfs还是卡住,老pod下不去, 新pod也上不来. 因此就hang在这里了. 经过调整后,所有pod我都手动删除了, 因此etcd里面没有记录了,自然也就成了孤儿.
因此kublete报错. 解决办法是删除或者移除目录即可.
## 孤儿pod如何处理.
May 09 10:37:25 m725-c114-20-k8s-uat-master02 kubelet[195202]: E0509 10:37:25.068476 195202 kubelet_volumes.go:154] Orphaned pod "40ccb5cc-919c-11ea-8800-98039b88740f" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them
#
cd /var/lib/kubelet/pods
mv 40ccb5cc-919c-11ea-8800-98039b88740f /tmp