SRE/DevOps

ChengDu, China

Тимлид/Руководитель группы

Информационные технологии • DevOps • Ansible • Docker • Kubernetes • Puppet • Bash • Go • Python • Azure SQL • Amazon Redshift • Apache Drill • Cassandra • ClickHouse • Einlineicsearch • Google BigQuery • MariaDB • MongoDB • MySQL • OLAP • PostgreSQL • Redis • AWS • Amazon S3 • Azure • Google Cloud • Google App Engine • OpenShift

19 января в 15:44

Удаленная работа • Частичная занятость
Опыт работы более 5 лет

Есть файл резюме (защищен)

Короткая ссылка: gkjb.ru/g13Nc

Пригласить

О себе

На данный момент fighterliudgmail.com.

Мои компетенции и опыт

Senior DevOps / SRE engineer with extensive experience in cloud-native infrastructure, large-scale Kubernetes operations(500+ clusters), and automation. Specialized in CI/CD, observability, high availability, and heterogeneous (ARM/x86, GPU) infrastructure in air-gapped deployments. Adept at working in fast-paced, cross-functional and international environments.

1. Automated Deployment & Environment Management (DevOps Toolchain)

Designed and maintained offline Ansible Playbooks for system initialization, base environment setup, and infrastructure component deployment.

Implemented scheduled Harbor image synchronization, streamlining release workflows and significantly improving delivery efficiency and system stability.

Integrated Argo CD to automate CD pipelines and generate deployment and test reports.

2. Multi-Architecture Support & Optimization (Heterogeneous Environments)

Led ARM-based hardware selection and platform-level business refactoring to improve portability and performance.

Managed service deployment and compatibility across heterogeneous environments (x86 & ARM).

Optimized GPU resource scheduling and sharing, improving overall utilization efficiency.

3. Global Monitoring & Observability

Built global monitoring architecture for 500+ Kubernetes clusters using Prometheus with Thanos Sidecar.

Developed custom exporters and Grafana dashboards to enhance pre-release health checks and service readiness validation.

Centralized alerting and monitoring data to improve anomaly detection accuracy and incident response efficiency.

Integrated OpenTelemetry and Jaeger for distributed tracing and end-to-end observability.

4. High Availability & Disaster Recovery (HADR Architecture)

Led backend platform high-availability and disaster recovery architecture redesign.

Improved system stability and availability, significantly reducing service interruption risks.

Introduced middleware rebalance mechanisms and automated health-check services to enhance self-healing capabilities in disaster scenarios.

5. Unified Platform & Tooling (Kubernetes Orchestration)

Refactored business modules into a Helm Chart Stack, enabling unified orchestration of middleware and databases.

Managed complex Kubernetes environments using Helmfile for configuration standardization and version control.

6. Access Control & Resource Optimization (Security & Resource Management)

Optimized Kubernetes deployment strategies and resource isolation, improving cluster utilization and security posture.

Designed and implemented resource management policies and account/permission optimization schemes.

7. Standardized Operations & Knowledge Base

Established and maintained SOPs covering Ansible playbooks, monitoring configurations, and platform architecture optimizations.

Built a structured operations knowledge base, improving team collaboration and operational efficiency.

8. Achievements & Impact

Increased environment provisioning and release efficiency by 5–8×.

Platform architecture optimizations led to significant improvements in system stability and availability.

Awarded Annual Outstanding Employee, Monthly Star, and received promotion.

Есть файл резюме (защищен)

Пригласить

Интересные кандидаты