Site Reliability Engineer (SRE)

Full Time

Ha Noi

Opening

With ~200 employees, OpenCommerce Group is a leading technology organization that offers e-commerce products with offices in China, and Hanoi. We are fortunate to have a team that can target vast and vibrant markets, with the benefit of ten years in the eCommerce industry and over one million online sales sector customers around the world. most nations in the world, such as the United States and China. The aim of the OpenCommerce Group is to create a product ecosystem to facilitate and improve e-commerce in general, as well as cross-border commerce in particular, and to serve as a launching pad for entrepreneurs. Starts expands and performs in the online world. We're expanding rapidly, and we're searching for top talent to help us create a global Commerce Community. Join OpenCommerce Group to expand the scope of your work.

JOB DESCRIPTION

Manage and improve system reliability through SLO, SLI, and SLA practices.
Design and implement observability systems (metrics, logs, tracing, alerting) using tools like Prometheus, Grafana, ELK, etc.
Build and automate CI/CD pipelines and Infrastructure as Code (IaC) using tools such as Terraform, Ansible, Pulumi, Helm.
Collaborate in the analysis, design, and deployment of systems and processes to ensure reliability, observability, and scalability.
Optimize system cost, performance (latency, throughput), and security.
Operate and optimize Kubernetes clusters (EKS); strong knowledge of Docker, Kubernetes, Helm is required.
Develop internal tools to automate workflows and support other teams.
Participate in incident response, root cause analysis, postmortem reviews, and improve incident handling processes.
Support and coordinate with NOC (Network Operation Center) teams.
Be part of the on-call rotation when needed.

REQUIREMENT

2–5 years of experience in SRE / DevOps / Platform Engineering.
Hands-on experience with monitoring and alerting systems (Prometheus, Grafana, ELK, Loki, etc.).
Proficient in CI/CD tools (GitLab CI, Jenkins) and familiar with Git workflows.
Experience in deploying and managing Kubernetes (EKS is a plus).
Understanding of gRPC, and capable of optimizing nginx connections and network stacks.
Strong Linux background with deep knowledge of kernel, network stack, file system, and processes.
Excellent troubleshooting skills — able to analyze issues from OS to application layer.
System-thinking mindset, focus on automation, and ability to mentor teammates.
Proactive, responsible, and able to work under pressure during incident response.

Nice to Have

Experience with AWS (EKS, EC2, RDS, CloudWatch).
Strong understanding of networking concepts (TCP/IP, DNS, Load Balancing, CDN).
Experience with high availability and distributed systems.
Previously built a complete observability stack.
Experience in building or optimizing Golang SDKs or internal frameworks.
Knowledge of cloud-native networking (CNI, overlay, BGP, eBPF-based load balancing).

BENEFITS

You'll find this place irresistible

Enjoy top-tier compensation, including:

Monthly NET take-home pay that leaves you smiling
13th-month salary
Performance bonuses that could boost your income up to 02 months' salary
24 remote working days per year
12 days of annual paid leave
Flexible working time, from Monday to Friday; weekends are yours
Company trips and team bonding activities
Elevate your creativity and productivity in our modern workspace
Especially:
Shine like a rock star in our fast-growing global B2B SaaS squad
Blaze a trail to success with our super-fast career track
Collaborate with the brightest and coolest minds from across the globe
Be yourself, knowing you're valued and groomed to be your absolute best.

Site Reliability Engineer (SRE)

JOB DESCRIPTION

REQUIREMENT

Nice to Have

BENEFITS

Join Our Team