CloudOps Engineer
The Cloud Ops Engineer will support solutions hosted on AWS, including Linux/Windows servers running on EC2. They will be responsible for all aspects of the production lifecycle of maintenance, and administration, including but not limited to: infrastructure automation, continuous integration and deployment, product release and support, running a scalable production environment for hosting the ARCOS platform, maintaining application/database availability, and ensuring continuous 24x7 production uptime of our services.
This job is a remote role, and you should be based in Poland.
What you'll do:
- Design, develop and maintain scalable AWS solutions and infrastructure, including but not limited to: EC2, RDS, S3, DynamoDB, Elasticache, and Route53.
- Develop tooling and processes to automate the deployment of SaaS based applications and their underlying operating systems and infrastructure.
- Perform PostgreSQL and Oracle database administration, including maintenance, troubleshooting, tuning, optimization, installation, upgrades, backup/recovery, and data migration.
- Partner with Engineering, Development, Quality Assurance, Professional Services, and Technical Support to ensure the success of the assigned product offerings and schedules.
- Engage in Agile team practices such as daily standups, backlog refinement, release planning and sprint planning.
- Coordinate configuration changes, installs, and upgrades with appropriate development teams and product owners while following company change control procedures.
- Participate in 24x7 on-call responsibilities, maintaining the availability and performance of all customer-facing production services.
- Triage and participate in the resolution of complex problems, including network connectivity issues, that span multiple tiers of application/infrastructure.
- Actively monitor supported systems and respond promptly to security or usability concerns.
- Review application logs and analyze events using cloud-native services (e.g. CloudWatch, CloudTrail) or third party SIEM tools (e.g. Splunk).
- Upgrade systems and processes as required for enhanced functionality and security compliance.
- Accurately document all processes and procedures for routine and non-routine tasks.
- All other duties and responsibilities as assigned.
- Bachelor's degree in Computer Science or related field, or equivalent work experience.
- 4-5 years of system administration experience, ideally in global management and operations of highly trafficked production applications. Experience working in a 24x7 SaaS environment is preferred.
- 4-5 years of experience designing solutions for and managing AWS services, including but not limited to: EC2, RDS, S3, DynamoDB, Elasticache, WAF/Shield, Route53, IAM and Directory Service, ECS, EKS, ECR, DNS, Parameter Store, ALB
- 2 years of experience with CI/CD technologies and best practices using AWS CodePipeline, CodeBuild, Github Actions or Bitbucket Pipelines.
- 2 years of experience with PostgreSQL, Oracle, SQL Server.
- Experience with hosting and supporting ESRI ArcGIS Server and FME Data Integration tools is a plus
- Experience with Linux and Windows system administration, automation and performance tuning.
- Experience with configuration management and infrastructure as code tools such as Ansible and Terraform.
- Experience with Apache, Nginx, Tomcat, NodeJS/PM2.
- Experience with scripting languages, including Bash, Python and Powershell.
- Knowledge of Kubernetes, Docker, Jira, Confluence.
- Advanced knowledge of system vulnerability management and security best practices.
- Solid understanding of observability, networking concepts and troubleshooting.
- Proven ability to work effectively with highly reliable and highly available mission critical technologies with detail and results shown while meeting deadlines.
- Ability to operate deployment automation, SaaS operations, internal and external SaaS infrastructure, security and cost management.
- Solid understanding of technical issues and opportunities related to modern cloud infrastructure and operations. Excellent written and verbal communication skills.
Production Support/On-Call Duties:
As a key member of our engineering team, you will address escalated production issues from customer support. Your responsibilities will include:
- Participating in a rotational on-call schedule to handle significant production issues.
- Rapidly diagnosing and resolving technical challenges that arise in production.
- Collaborating with customer support and engineering teams for seamless issue resolution.
- Maintaining clear communication and documentation during and after incidents.
- Leveraging these experiences to contribute to continuous process improvement.