hackquest logo

Operation And Maintenance Engineer

B

Best Web3

7 - 8.5K USD
Full-time
Monitoring Tools (PrometheusGrafana)Database Management (SQLMongoDB)Infrastructure as Code (TerraformAnsible)Cloud Services (AWSAzureGCP)LinuxCI/CDGitShell Scripting

Job Responsibilities

I. Infrastructure and Server Operations (Core Responsibilities)

  • Responsible for the architecture design, setup, and optimization of the company's server clusters (OCI / AWS).
  • Manage Linux servers, system environments, user permissions, SSH keys, SFTP, Firewall, and Security Groups.
  • Responsible for Nginx, SSL, reverse proxy, domain name, and certificate management, maintaining high availability and security.
  • Maintain virtual machines, load balancers (LB), object storage, VPC/VCN networks, subnets, and security group policies.
  • Troubleshoot production environment issues: port conflicts, permission errors, service startup failures, full disks, network anomalies, etc.

II. CI/CD and Deployment Management

  • Design, build, and maintain CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
  • Write and maintain deployment scripts, automated build scripts, environment variable management, and version release processes.
  • Responsible for deployment strategies, rollback strategies, blue-green deployments, and canary deployments in testing/UAT/production environments.
  • Collaborate with the R&D team for daily releases, emergency fixes, and configuration management.

III. System Stability and Availability (SRE Focus)

  • Establish an application monitoring system (Prometheus, Grafana, ELK, CloudWatch).
  • Responsible for building an alerting system: CPU/Memory/Disk, service anomalies, and interface anomalies.
  • Responsible for the formulation and implementation of SLAs, SLOs, and SLIs to improve system stability.
  • Perform regular capacity planning, performance optimization, and system load testing.

IV. Security and Access Control

  • Manage server accounts, cloud platform accounts, Git repository permissions, and Jira/Wiki system permissions.
  • Build/maintain bastion hosts (Jump Server/Bastion), adhering to the principle of least privilege.
  • Write security baseline policies and regularly perform patch upgrades, vulnerability scanning, and security inspections.
  • Cooperate with the security/risk control team to handle security incidents (brute-force attacks, abnormal traffic, service vulnerabilities, etc.).

V. Database and Middleware Maintenance

  • Maintain the deployment, backup, and master-slave configuration of services such as MySQL, PostgreSQL, Redis, and Kafka.
  • Database performance tuning, slow SQL analysis, and connection pool optimization.
  • Implement backup strategies, automatic backups, off-site disaster recovery, and regular recovery drills.

VI. Documentation and Asset Management

  • Maintain server ledgers, domain certificate ledgers, and permission lists.
  • Write and maintain operation and maintenance documentation: deployment instructions, deployment processes, security policies, and architecture diagrams.
  • Manage operation and maintenance assets: server specifications, monitoring panels, keys, environment configurations, and network topology diagrams.

VII. Team and Process Development

  • Responsible for the daily management and training of the operation and maintenance team.
  • Drive the implementation of production change processes, deployment procedures, permission management procedures, and disaster recovery procedures.
  • Coordinate across teams (R&D, backend, DBA, and security teams) to handle emergency failures.

Job Requirements

  • Proficient in Linux system administration, Shell scripting, and network basics (Layer 3/Layer 4/Layer 7).
  • Familiar with cloud platform operation and maintenance: OCI/AWS.
  • Proficient in Nginx, SSL, reverse proxy, Keepalived, and load balancing.
  • Familiar with Docker/Kubernetes (at least Docker + Compose must be proficient).
  • Familiar with CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins).
  • Proficient in MySQL basics, master-slave replication, backup and recovery, and performance optimization.
  • Familiar with at least one commonly used middleware such as Redis, Kafka, or RabbitMQ.
  • Experience in building monitoring systems: Prometheus / Grafana / ELK / Loki.
  • Bonus points: Strong logical thinking and rapid troubleshooting abilities; able to independently handle online incidents.
  • A complete operational system mindset: monitoring, alerting, security, permissions, and processes.
  • Excellent documentation skills; able to organize asset tables, network topology, and process procedures.
  • Strong communication and cross-team collaboration skills.
  • Experience in operations and maintenance in the financial, exchange, and blockchain industries.
  • Familiar with high-concurrency and high-availability architecture design.