Operation And Maintenance Engineer at Best Web3

Job Responsibilities

Responsible for the architecture design, setup, and optimization of the company's server clusters (OCI / AWS).
Manage Linux servers, system environments, user permissions, SSH keys, SFTP, Firewall, and Security Groups.
Responsible for Nginx, SSL, reverse proxy, domain name, and certificate management, maintaining high availability and security.
Maintain virtual machines, load balancers (LB), object storage, VPC/VCN networks, subnets, and security group policies.
Troubleshoot production environment issues: port conflicts, permission errors, service startup failures, full disks, network anomalies, etc.

Design, build, and maintain CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
Write and maintain deployment scripts, automated build scripts, environment variable management, and version release processes.
Responsible for deployment strategies, rollback strategies, blue-green deployments, and canary deployments in testing/UAT/production environments.
Collaborate with the R&D team for daily releases, emergency fixes, and configuration management.

Establish an application monitoring system (Prometheus, Grafana, ELK, CloudWatch).
Responsible for building an alerting system: CPU/Memory/Disk, service anomalies, and interface anomalies.
Responsible for the formulation and implementation of SLAs, SLOs, and SLIs to improve system stability.
Perform regular capacity planning, performance optimization, and system load testing.

Manage server accounts, cloud platform accounts, Git repository permissions, and Jira/Wiki system permissions.
Build/maintain bastion hosts (Jump Server/Bastion), adhering to the principle of least privilege.
Write security baseline policies and regularly perform patch upgrades, vulnerability scanning, and security inspections.
Cooperate with the security/risk control team to handle security incidents (brute-force attacks, abnormal traffic, service vulnerabilities, etc.).

Maintain the deployment, backup, and master-slave configuration of services such as MySQL, PostgreSQL, Redis, and Kafka.
Database performance tuning, slow SQL analysis, and connection pool optimization.
Implement backup strategies, automatic backups, off-site disaster recovery, and regular recovery drills.

Maintain server ledgers, domain certificate ledgers, and permission lists.
Write and maintain operation and maintenance documentation: deployment instructions, deployment processes, security policies, and architecture diagrams.
Manage operation and maintenance assets: server specifications, monitoring panels, keys, environment configurations, and network topology diagrams.

Responsible for the daily management and training of the operation and maintenance team.
Drive the implementation of production change processes, deployment procedures, permission management procedures, and disaster recovery procedures.
Coordinate across teams (R&D, backend, DBA, and security teams) to handle emergency failures.

Proficient in Linux system administration, Shell scripting, and network basics (Layer 3/Layer 4/Layer 7).
Familiar with cloud platform operation and maintenance: OCI/AWS.
Proficient in Nginx, SSL, reverse proxy, Keepalived, and load balancing.
Familiar with Docker/Kubernetes (at least Docker + Compose must be proficient).
Familiar with CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
Proficient in MySQL basics, master-slave replication, backup and recovery, and performance optimization.
Familiar with at least one commonly used middleware such as Redis, Kafka, or RabbitMQ.
Experience in building monitoring systems: Prometheus / Grafana / ELK / Loki.
Bonus points: Strong logical thinking and rapid troubleshooting abilities; able to independently handle online incidents.
A complete operational system mindset: monitoring, alerting, security, permissions, and processes.
Excellent documentation skills; able to organize asset tables, network topology, and process procedures.
Strong communication and cross-team collaboration skills.
Experience in operations and maintenance in the financial, exchange, and blockchain industries.
Familiar with high-concurrency and high-availability architecture design.