Site Reliability Engineer (SRE)
With ~200 employees, OpenCommerce Group is a leading technology organization that offers e-commerce products with offices in China, and Hanoi. We are fortunate to have a team that can target vast and vibrant markets, with the benefit of ten years in the eCommerce industry and over one million online sales sector customers around the world. most nations in the world, such as the United States and China. The aim of the OpenCommerce Group is to create a product ecosystem to facilitate and improve e-commerce in general, as well as cross-border commerce in particular, and to serve as a launching pad for entrepreneurs. Starts expands and performs in the online world. We're expanding rapidly, and we're searching for top talent to help us create a global Commerce Community. Join OpenCommerce Group to expand the scope of your work.
JOB DESCRIPTION
- Manage and improve system reliability through SLO, SLI, and SLA practices.
- Design and implement observability systems (metrics, logs, tracing, alerting) using tools like Prometheus, Grafana, ELK, etc.
- Build and automate CI/CD pipelines and Infrastructure as Code (IaC) using tools such as Terraform, Ansible, Pulumi, Helm.
- Collaborate in the analysis, design, and deployment of systems and processes to ensure reliability, observability, and scalability.
- Optimize system cost, performance (latency, throughput), and security.
- Operate and optimize Kubernetes clusters (EKS); strong knowledge of Docker, Kubernetes, Helm is required.
- Develop internal tools to automate workflows and support other teams.
- Participate in incident response, root cause analysis, postmortem reviews, and improve incident handling processes.
- Support and coordinate with NOC (Network Operation Center) teams.
- Be part of the on-call rotation when needed.
REQUIREMENT
- 2–5 years of experience in SRE / DevOps / Platform Engineering.
- Hands-on experience with monitoring and alerting systems (Prometheus, Grafana, ELK, Loki, etc.).
- Proficient in CI/CD tools (GitLab CI, Jenkins) and familiar with Git workflows.
- Experience in deploying and managing Kubernetes (EKS is a plus).
- Understanding of gRPC, and capable of optimizing nginx connections and network stacks.
- Strong Linux background with deep knowledge of kernel, network stack, file system, and processes.
- Excellent troubleshooting skills — able to analyze issues from OS to application layer.
System-thinking mindset, focus on automation, and ability to mentor teammates. - Proactive, responsible, and able to work under pressure during incident response.
Nice to Have
- Experience with AWS (EKS, EC2, RDS, CloudWatch).
- Strong understanding of networking concepts (TCP/IP, DNS, Load Balancing, CDN).
- Experience with high availability and distributed systems.
- Previously built a complete observability stack.
- Experience in building or optimizing Golang SDKs or internal frameworks.
- Knowledge of cloud-native networking (CNI, overlay, BGP, eBPF-based load balancing).
BENEFITS
- Competitive monthly NET salary, transparent and fully take-home
- up to 16 months’ salary per year, including a 13th-month salary, quarterly incentives, and annual performance bonuses.
- 24 remote working days per year, enabling a healthy work–life balance
- 12 days of paid annual leave, in addition to public holidays
- Flexible working hours, Monday to Friday – weekends are fully yours
- Annual health check-ups
- Full social insurance coverage (BHXH) in compliance with Vietnamese labor regulations
- Company-sponsored sports clubs to support both physical and mental well-being
- Regular company trips and team bonding activities
- Be part of a fast-growing global B2B SaaS organization
- Clear and accelerated career development and promotion pathways
- Collaborate with talented, diverse, and high-performing teams across regions
- Work in a modern, open, and empowering environment where individuality is respected and potential is nurtured