Site Reliability Engineer

About Us

We are a passionate team of open source developers with a desire to build a successful and sustainable business that can impact the world at large. Our mission is to create open source, enterprise-grade products that help individuals and organizations unlock their potential and become top performers in their respective domains. To achieve this, we are building a suite of tools that span the entire web development lifecycle ranging from a best in class local development experience all the way through multi-cloud, high-availability hosting (PaaS or self-hosted). To learn more, please visit https://wwww.drud.com/, our GitHub (https://github.com/drud/), and governance (https://github.com/drud/community) pages.

Roles and Responsibilities

Be professional, courteous, kind and responsive to others you engage with.
Integrate with a fast-paced engineering team to design, develop and deliver our local development and hosting products.

Help maintain 24×7 uptime on public cloud-based infrastructure.

Be a first responder during outages for clients with managed hosting and self-hosting with a support package.
Help design, build, and maintain solutions around logging, networking, monitoring, security, disaster recovery, etc.

Requirements

An overall team-centric philosophy and strong emotional Intelligence score is absolutely a must. Google spent a tremendous amount of effort to discover that the keys to high performing through Project Aristotle, and we feel that we have a lot to gain by standing on the shoulders of giants when building out our team. We have a strong affinity for cloud-native technologies and so should you. You must love highly distributed mission-critical computing using modern technologies and languages.

Qualifications

  • Experience managing production Kubernetes clusters.
  • Must be fluent in at least one programming language such as Python, GoLang or Ruby.
  • 3+ years in a combination of DevOps, SRE, or Systems Operations roles.
  • 3+ years experience managing Linux based servers. CoreOS is a big plus
  • Demonstrated understanding of containers and container orchestration.
  • Troubleshooting skills that span systems, network (TCP/IP), and code.
  • Must have experience building or managing large-scale systems and application architectures.
  • Solid understanding of system performance and monitoring.
  • Working knowledge of cloud computing including virtualization, hosted services, multi-tenant cloud infrastructures, distributed storage systems and content delivery networks.
  • Experience working with source control management tools, GitHub is a huge plus.
  • Excellent verbal and written communication skills.

Nice to Haves

  • Production experience with federated Kubernetes clusters
  • Experience with service meshes such as Istio or linkerd
  • Experience with multiple large cloud hosting providers: AWS, GCP, and Azure
  • Experience with load balancers such as Elastic Load Balancer, NGINX, Envoy, HAProxy or Google Cloud Load Balancer
  • Experience with messaging technologies: Kafka, RabbitMQ, NATS.
  • Experience with infrastructure configuration and automation processes and tools: Ansible, Fabric, Terraform, Puppet, Chef.
  • Experience with monitoring solutions: Prometheus, ELK, Splunk, SUMO, Nagios or fluentd
  • Experience with various data technologies including relational and nonrelational databases and message queues.
  • Experience with distributed storage systems: Ceph, GlusterFS, EFS, EBS or Rook

Benefits

  • Flexible vacation/time-off.
  • Competitive salaries and performance-based raises.
  • Health, vision and dental insurance.
  • Professional development opportunities.
  • An amazing team of like-minded individuals to create with.

Applications (including a resume, a cover letter, and any additional information that would be relevant to the position) can be sent to careers@drud.com.

Please follow and like us: