DevOps Engineer
BB Wave Inc.
The DevOps Engineer is responsible for maintaining, monitoring, automating, and optimizing the company's infrastructure, applications, and deployment processes. The role ensures system reliability, availability, scalability, and security through proactive monitoring, incident response, automation, and continuous improvement of operational workflows.
I. Responsibilities:
1. Monitoring and System Inspection
• Continuously monitor metrics and alerts through platforms such as Prometheus, Grafana, Huawei Cloud Monitoring, and other monitoring tools.
• Track key business indicators, including QPS (Queries Per Second), error rates, response times, success rates, and the health status of infrastructure resources such as CPU, memory, disk, and network utilization.
• Conduct routine inspections of production systems, data center environments, and network connectivity, and prepare inspection reports in accordance with established procedures.
2. Alert Management and Incident Response
• Respond promptly to alerts received through phone calls, SMS, Microsoft Teams, and other communication channels, following established Standard Operating Procedures (SOPs).
• Classify alerts according to severity levels and execute the corresponding response plans.
• Independently resolve incidents when possible, such as restarting services, clearing disk space, scaling resources, or traffic switching.
• Escalate unresolved issues to the Development Team following the escalation process and track progress until resolution while documenting all actions taken.
• Continuously optimize alerting rules to minimize false positives, missed alerts, and alert storms.
3. Daily Operations and Process Management
• Handle operational requests submitted through the ticketing system, including:
- Account provisioning
- Access and permission requests
- Resource allocation requests
- Firewall policy modifications
- Domain and SSL certificate applications
• Support deployment activities, including application releases, rollbacks, and configuration changes following standardized procedures.
• Assist with routine operational tasks such as backup verification, slow query analysis, and vulnerability scan follow-ups.
• Maintain and update CMDB (Configuration Management Database) records, ensuring accurate information on servers, IP addresses, applications, and responsible personnel.
4. Incident Management and Post-Mortem Analysis
• Serve as the first responder during system incidents by coordinating communication channels, notifying stakeholders, managing resources, and providing timely updates.
• Participate in incident post-mortem reviews by documenting timelines, root causes, and improvement actions.
• Contribute to the continuous improvement of SOPs and the internal knowledge base.
5. Documentation and Knowledge Management
• Develop and maintain workflow manuals, emergency response procedures, and SOP documentation.
• Prepare regular weekly and monthly operational reports, including:
- Top alert statistics
- Incident frequency
- Resolution times
- Service performance metrics
6. Perform tasks or responsibilities as may be assigned by the Management and Department Head.
II. Job Requirements:
Education and Experience:
• Associate's Degree, Bachelor's Degree, or higher in Computer Science, Information Technology, Network Engineering, Telecommunications, or a related field.
• 1–3 years of experience in IT Operations, System Administration, or DevOps. Outstanding fresh graduates are encouraged to apply.
Technical Skills:
• Proficient in Linux administration and commonly used commands, with the ability to:
Analyze logs
Troubleshoot processes
Perform network connectivity testing
Diagnose disk and memory-related issues
• Good understanding of TCP/IP, HTTP, DNS, and Load Balancing concepts.
• Ability to interpret and troubleshoot using tools such as:
Ping
Telnet
Curl
Tcpdump
• Experience with at least one monitoring platform:
Zabbix
Prometheus + Grafana
Loki
Huawei Cloud Monitoring
• Familiarity with the operation and maintenance of common web services and middleware, including:
Nginx
Tomcat
Redis
MySQL
• Strong command of Linux utilities such as:
grep
awk
systemd
netstat
df
top
• Ability to develop Shell or Python scripts for automation and routine operational tasks.
• Familiarity with Kubernetes ecosystem tools and configurations, including:
Helm
Operators
Istio
Grafana
Prometheus
Basic YAML configuration
• Experience with automation and CI/CD tools such as:
Ansible
Jenkins
ArgoCD
• Ability to trigger pipelines, review logs, and troubleshoot build failures.
• Proficient in using kubectl commands, including:
get
describe
logs
exec
• Solid understanding of Kubernetes concepts, including:
Pods
Services
Deployments
StatefulSets
Persistent Volumes (PV)
Persistent Volume Claims (PVC)
Soft Skills
• Excellent communication and interpersonal skills.
• Ability to remain calm and organized during incidents and effectively communicate updates.
• Strong sense of ownership, accountability, and teamwork.
• Ability to clearly explain technical issues and coordinate with cross-functional teams.
III. Preferred Qualifications
• Familiarity with IT Service Management processes, including:
Incident Management
Problem Management
Change Management
Configuration Management
• Experience with ticketing systems such as:
Jira Service Management
ServiceNow
ONES
Proprietary ticketing platforms
• Hands-on experience with Docker and Kubernetes operations, including:
Checking Pod status
Reviewing logs
Restarting services
• Experience supporting large-scale business environments such as:
E-commerce
Financial Services
Gaming
Live Streaming Platforms
• Ability to develop automation tools using Python or Go to streamline repetitive operational tasks.
DevOps Work Environment and Characteristics
• This is not a purely reactive monitoring role. Engineers are encouraged to proactively identify issues, optimize processes, and improve system reliability.
• A structured escalation and incident management process is in place to ensure efficient issue resolution and accountability.
• The organization provides comprehensive operational tools, including monitoring, logging, alerting, and management platforms to support daily operations and incident response.