Site Reliability Engineer (Boston or Remote)
Full-time Cogito Remote, Massachusetts - MA
You have deep experience in running Kubernetes infrastructure for production SaaS systems on a global scale. You have organized a team to provide responsive on-call support for both external and internal customers, maintained and upgraded large scale production systems and cloud infrastructure to meet strict security requirements while maintaining operational SLOs and customer SLAs.
You have experience in platform migrations and have moved large production workloads from legacy systems to Kubernetes based containerized microservices architecture successfully. You possess very strong troubleshooting skills and have led a team of SREs/operations engineers to perform troubleshooting, system maintenance and software updates in the cloud native Kubernetes environment.
You have transformed the operations model to enable continuous deployments and high developer productivity while the company has experienced hypergrowth and international expansion. You are well organized and thrive in fast paced environments where priorities are set based on business needs. Conversant with a large variety of subjects, you have the ability to triage and manage a broad range of issues. You have led multiple projects to successful completion and deployment to production.
- Delight internal and external customers by responsive and well organized on-call SRE team support and highly performant and well maintained tools, systems and Kubernetes based production infrastructure.
- Provide timely resolution to customer concerns and issues. Troubleshoot software and infrastructure issues as needed.
- Maintain PCI, HITRUST, HIPAA and SOC2 status by maintaining the tools and systems you are responsible for, keeping the software updated and providing support for our security team during the security audits.
- Continuously improve the reliability and cost efficiency of our services and infrastructure.
- Develop and drive SRE engagement model, conduct production readiness reviews and improve our operational processes to enable company growth and international expansion.
- Automate processes and practices to manage cloud infrastructure lifecycle and configurations to client specifications.
- Design and architect technical solutions to meet customer requirements and communicate to a broad range of stakeholders within the business.
- Bachelor’s degree in a CS/IS/IT/System Administration related field or equivalent experience
- 3+ years in a DevOps, Site Reliability Engineer or equivalent role
- Willingness to learn new technologies and skills on the fly
- Demonstrate a history of working in environments with any of the following compliance standards: PCI, HITRUST, HIPAA, Sarbanes Oxley, ISO27002, CIS L1 & L2
- Extensive experience in production Kubernetes clusters and related tooling (Service Mesh, Ingress Controller, Operators). This is a critical requirement to be successful in this role.
- Extensive experience with a public cloud provider (AWS is preferred).
- Proficient programming/ scripting languages to automate repeatable processes and develop/enhance microservice-based systems ( e.g. bash, Python, Go, Java)
- Experience with Configuration management tools (e.g. Ansible, Chef, Puppet, etc)
- Experience in building “Infrastructure as code” (e.g. Terraform, Cloudformation etc)
- Deep understanding of Linux
- Extensive experience with a CI/CD tool such as Jenkins, Travis CI etc.
- Extensive experience in troubleshooting and debugging application related network and infrastructure issues
- Experience in production SaaS environments
- Excellent communication and documentation skills
- Experience working in company that practices both an Agile and DevOps mentality
Compensation & Benefits
- Your choice of comprehensive benefits for you and your - dependents effective on date of hire; health, dental, vision, flexible spending, life insurance, disability, additional voluntary supplemental life insurance
- Pet Insurance
- Employee Assistance Programs (EAP)
- 20 days vacation time, 5 days sick time, 2 floating holidays and 11 company holidays
- 2 "Be Gentle" personal days
- 401(k) retirement plan options
- Competitive pay and bonus eligibility
- Stock options via equity grants
- Ongoing professional development and cross-training
- Company paid parental leave upon hire
- Office Optional policy where Cogician’s choose where they work either primarily remote, primarily in office or hybrid
- Ability to support Cogician’s anywhere in the US through our Office Optional policy
- Employee Referral Bonus Program
- Employee Resource Groups
Cogito is searching for a Site Reliability Engineer to join our organization. The ideal candidate has a mix of customer-facing skills, strong operational production support experience and systems know-how.