K3s 学习（一）境内 K3s 集群的特殊设置

前言

Docker Hub 被 Ban 掉之后境内的 K3s 集群就几乎瘫痪了 🤦，找补一下。

问题与解决

一、Docker Hub 镜像拉取问题

1、问题描述

我是在 Rancher 导入新的集群时发现的。新的集群一直处于 Pending 状态，查看日志发现是因为镜像拉取失败导致的：

kubectl get pods -n cattle-system
kubectl logs cattle-cluster-agent-xxxxxxxxx-aabbc -n cattle-syste

错误信息为：

Error from server (BadRequest): container "cluster-register" in pod "cattle-cluster-agent-xxxxxxxxx-aabbc" is waiting to start: image can't be pulled
Error from server (BadRequest): container "cluster-register" in pod "cattle-cluster-agent-xxxxxxxxx-aabbc" is waiting to start: trying and failing to pull image

2、解决方法

首先需要找一个 Docker Hub 的镜像站：Docker Hub 镜像加速器
或者申请一个阿里云的镜像加速器：阿里云容器镜像服务
当然你也可以和我一样自建：部署 Docker 镜像加速服务
然后修改下对应集群节点的配置文件：/etc/rancher/k3s/registries.yaml

vi /etc/rancher/k3s/registries.yaml

文件内容：

mirrors:
  "docker.io":
    endpoint:
      - "https://registrie.ceshiku.cn"
      - "https://xxxxxxxx.mirror.aliyuncs.com"

重启 K3s 服务：

systemctl restart k3s

3、确认解决

再次查看集群状态：

kubectl get pods -n cattle-system
kubectl logs cattle-cluster-agent-xxxxxxxxx-aabbc -n cattle-syste

二、K3s 集群节点安装

curl -sfL https://rancher-mirror.rancher.cn/k3s/k3s-install.sh | \
  INSTALL_K3S_MIRROR=cn \
  sh -s - \
  # 使用自定义的镜像加速器作为默认镜像源
  --system-default-registry=registrie.ceshiku.cn

三、无法解析 Rancher 面板的域名

1、问题描述

ERROR: https://rancher.ceshiku.cn/ping is not accessible (Could not resolve host: rancher.ceshiku.cn)

2、解决方法

官方文档：Agent 无法连接 Rancher server
自建下域名和 IP 的映射关系：

kubectl -n cattle-system patch  deployments cattle-cluster-agent --patch '{
    "spec": {
        "template": {
            "spec": {
                "hostAliases": [
                    {
                      "hostnames":
                      [
                        "rancher.ceshiku.cn"
                      ],
                      "ip": "1.2.3.4"
                    }
                ]
            }
        }
    }
}'

# 这一步可能会报错，因为没有 daemonsets，可以忽略
kubectl -n cattle-system patch  daemonsets cattle-node-agent --patch '{
 "spec": {
     "template": {
         "spec": {
             "hostAliases": [
                 {
                    "hostnames":
                      [
                        "rancher.ceshiku.cn"
                      ],
                    "ip": "1.2.3.4"
                 }
             ]
         }
     }
 }
}'

四、CoreDNS 一直处于 CrashLoopBackOff 状态

1、问题描述

kubectl get pods -A

kube-system     coredns-5fc867466c-rrkqq                0/1     CrashLoopBackOff   36 (3m21s ago)   44m

2、解决方法

在导入集群前，修改 Server 宿主机的 DNS 配置：

vi /etc/systemd/resolved.conf

[Resolve]
DNS=8.8.8.8,223.5.5.5

systemctl restart systemd-resolved
sudo mv /etc/resolv.conf /etc/resolv.conf.bak
sudo ln -s /run/systemd/resolve/resolv.conf /etc/

之后需要重装集群。

五、CPU 飙升服务器卡死

大概率内存不够，尝试增加虚拟内存解决：KVM 虚拟化的服务器建立 SWaP 分区以增加虚拟内存