DevOps Engineer at BB Wave Inc.

The DevOps Engineer is responsible for maintaining, monitoring, automating, and optimizing the company's infrastructure, applications, and deployment processes. The role ensures system reliability, availability, scalability, and security through proactive monitoring, incident response, automation, and continuous improvement of operational workflows.

I. Responsibilities:

1. Monitoring and System Inspection

• Continuously monitor metrics and alerts through platforms such as Prometheus, Grafana, Huawei Cloud Monitoring, and other monitoring tools.

• Track key business indicators, including QPS (Queries Per Second), error rates, response times, success rates, and the health status of infrastructure resources such as CPU, memory, disk, and network utilization.

• Conduct routine inspections of production systems, data center environments, and network connectivity, and prepare inspection reports in accordance with established procedures.

2. Alert Management and Incident Response

• Respond promptly to alerts received through phone calls, SMS, Microsoft Teams, and other communication channels, following established Standard Operating Procedures (SOPs).

• Classify alerts according to severity levels and execute the corresponding response plans.

• Independently resolve incidents when possible, such as restarting services, clearing disk space, scaling resources, or traffic switching.

• Escalate unresolved issues to the Development Team following the escalation process and track progress until resolution while documenting all actions taken.

• Continuously optimize alerting rules to minimize false positives, missed alerts, and alert storms.

3. Daily Operations and Process Management

• Handle operational requests submitted through the ticketing system, including:

Account provisioning
Access and permission requests
Resource allocation requests
Firewall policy modifications
Domain and SSL certificate applications

• Support deployment activities, including application releases, rollbacks, and configuration changes following standardized procedures.

• Assist with routine operational tasks such as backup verification, slow query analysis, and vulnerability scan follow-ups.

• Maintain and update CMDB (Configuration Management Database) records, ensuring accurate information on servers, IP addresses, applications, and responsible personnel.

4. Incident Management and Post-Mortem Analysis

• Serve as the first responder during system incidents by coordinating communication channels, notifying stakeholders, managing resources, and providing timely updates.

• Participate in incident post-mortem reviews by documenting timelines, root causes, and improvement actions.

• Contribute to the continuous improvement of SOPs and the internal knowledge base.

5. Documentation and Knowledge Management

• Develop and maintain workflow manuals, emergency response procedures, and SOP documentation.

• Prepare regular weekly and monthly operational reports, including:

Top alert statistics
Incident frequency
Resolution times
Service performance metrics

6. Perform tasks or responsibilities as may be assigned by the Management and Department Head.

II. Job Requirements:

Education and Experience:

• Associate's Degree, Bachelor's Degree, or higher in Computer Science, Information Technology, Network Engineering, Telecommunications, or a related field.

• 1–3 years of experience in IT Operations, System Administration, or DevOps. Outstanding fresh graduates are encouraged to apply.

Technical Skills:

• Proficient in Linux administration and commonly used commands, with the ability to:

 Analyze logs

 Troubleshoot processes

 Perform network connectivity testing

 Diagnose disk and memory-related issues

• Good understanding of TCP/IP, HTTP, DNS, and Load Balancing concepts.

• Ability to interpret and troubleshoot using tools such as:

 Ping

 Telnet

 Curl

 Tcpdump

• Experience with at least one monitoring platform:

 Zabbix

 Prometheus + Grafana

 Loki

 Huawei Cloud Monitoring

• Familiarity with the operation and maintenance of common web services and middleware, including:

 Nginx

 Tomcat

 Redis

 MySQL

• Strong command of Linux utilities such as:

 grep

 awk

 systemd

 netstat

 df

 top

• Ability to develop Shell or Python scripts for automation and routine operational tasks.

• Familiarity with Kubernetes ecosystem tools and configurations, including:

 Helm

 Operators

 Istio

 Grafana

 Prometheus

 Basic YAML configuration

• Experience with automation and CI/CD tools such as:

 Ansible

 Jenkins

 ArgoCD

• Ability to trigger pipelines, review logs, and troubleshoot build failures.

• Proficient in using kubectl commands, including:

 get

 describe

 logs

 exec

• Solid understanding of Kubernetes concepts, including:

 Pods

 Services

 Deployments

 StatefulSets

 Persistent Volumes (PV)

 Persistent Volume Claims (PVC)

Soft Skills

• Excellent communication and interpersonal skills.

• Ability to remain calm and organized during incidents and effectively communicate updates.

• Strong sense of ownership, accountability, and teamwork.

• Ability to clearly explain technical issues and coordinate with cross-functional teams.

III. Preferred Qualifications

• Familiarity with IT Service Management processes, including:

 Incident Management

 Problem Management

 Change Management

 Configuration Management

• Experience with ticketing systems such as:

 Jira Service Management

 ServiceNow

 ONES

 Proprietary ticketing platforms

• Hands-on experience with Docker and Kubernetes operations, including:

 Checking Pod status

 Reviewing logs

 Restarting services

• Experience supporting large-scale business environments such as:

 E-commerce

 Financial Services

 Gaming

 Live Streaming Platforms

• Ability to develop automation tools using Python or Go to streamline repetitive operational tasks.

DevOps Work Environment and Characteristics

• This is not a purely reactive monitoring role. Engineers are encouraged to proactively identify issues, optimize processes, and improve system reliability.

• A structured escalation and incident management process is in place to ensure efficient issue resolution and accountability.

• The organization provides comprehensive operational tools, including monitoring, logging, alerting, and management platforms to support daily operations and incident response.