The Control Group logo

The Control Group

Sr Site Reliability Engineer (SRE DevOps)

Reliability Engineering – San Diego, California
Department Reliability Engineering
Employment Type Full-Time Active
Minimum Experience Experienced


The Control Group (“TCG”) is entering a new and exciting phase in 2018 we are looking for  a talented, collaborative Senior Site Reliability Engineer (DevOps) to join our award-winning team. Would you like to work in a true Development + Ops environment? Do you geek out on data – does the thought of working with petabytes of analytics data blow your mind? How about working with 300+ servers daily and economies of scale that are insane? Do you want to play with the Big Dogs?!?! If so, then this may be the opportunity for you. Please note this is full time employment, not contract.

TCG is a premier web development company with ~15 million customers nationwide. Our websites are consistently ranked in the top 500 top traffic sites in the US. We are constantly pioneering new ideas, products and services and continually looking to develop and deploy innovative strategies and solutions. We are a five-year-old, profitable, stable and growing DevOps centric company. Check out TCG at

TCG offers a highly competitive salary + bonus package, 100% company paid Medical/Dental/Vision benefits, UNLIMITED Vacation, paid Sick Leave, paid holidays, 401k plan with company match. Company perks include: Catered lunches, fully stocked kitchen with free drinks and food, Xbox, foosball table, BBQ grill, surfboards, free throw basketball games, massages, and onsite fitness classes. Dog friendly office - Bring your dog to work with you! Highly socially conscious and growing rapidly. Casual dress code (T-shirt and flip-flops are our uniform) with an 80+ team at our offices in Pacific Beach and Little Italy. Multiple year winner of San Diego Business Journal Best Workplaces and Union Tribune Top Places to Work in San Diego!

You will already have outstanding experience setting up and automating high availability data clusters and working with virtualization and container technologies. You are expert in scripting languages Python and BASH. You relish complex technical challenges and the adrenaline rush when your skills and experience are pushed to the limit. You are passionate about collaborating with Devs to bridge the gap between development and process. You understand how to develop business-centric processes that convert to profitability. You are not satisfied with using only traditional scripting languages and tools, but are always seeking out and learning new technologies.


  • Use a holistic understanding of technical environments and business outcomes to help the company move to faster test and highly successful product deployments.
  • Ensure that sites  and systems continuously and consistently run smoothly, optimally, efficiently and reliably.
  • Collaborate with Development teams to create new and continuously improve products, including planning, testing, staging and deployment.
  • Proactively and regularly communicate with Development teams to ensure that new (or improved) software works efficiently across diverse operating systems and platforms.
  • Manage large systems, and maintain very high-quality end user experience even while introducing new features.
  • Develop new features, scaling, automation and self-healing processes for sites and systems.
  • Automate configuration management, deployment of product releases and provisioning of servers in development, QA, staging and production environments.
  • Troubleshoot and restore sites to full performance optimization, as required.
  • Work with hosting provider to manage/grow hardware on dedicated provider cloud.
  • Optimize and audit server infrastructure/configuration and develop/implement improvements to network architecture.
  • Support clustered web environment that receives 500-1000 requests per second.
  • Manage DNS and firewall rules.
  • Manage security of production infrastructure, systems and application in compliance with PCI requirements.
  • Other duties as required.

Required Skills and Experience

  • 5+ years’ experience in Reliability Engineering and/or DevOps, at least 3 years with high volume websites.
  • Expert in scripting languages Python and BASH.
  • Strong experience running production workloads in a cloud environment.
  • Strong experience using virtualization software such as VMware, OpenStack, VirtualBox.
  • Solid experience developing, maintaining, and debugging applications built in PHP, Node.JS, or GoLang
  • Solid experience with automation and configuration management tools such as Salt, Vagrant, Fabric, Jenkins, etc.
  • Solid experience with container technologies such as Kubernetes, Docker, CoreOS, OpenStack, Vagrant, Ansible.
  • Production experience in designing, deploying and administering complex cloud applications (API Gateway, Lambda, ECS, ALB, WAF, EC2, RDS, Elasticache, Elasticsearch, SQS, IAM, VPC, Cloudformation).
  • Strong experience setting up and automating high availability data clusters (MySQL, PostgreSQL replication, Redis cluster, ElasticSearch clustering, etc.).
  • Experience with:
    • Linux Systems Administration (Ubuntu, Red Hat)
    • Web Serving Software/Architecture (Nginx, PHP-FPM, Redis)
    • Concurrent Versioning Software (Git, GitHub or Bitbucket)
  • Understanding of business outcomes and how to deploy business-centric processes that convert to profitability.
  • Comfortable with frequent, incremental code testing and deployment.
  • Familiar with debugging tools like Wireshark, Traceroute, Tcpdump, Dnsmasq and strace.
  • Strong ability to collaborate and openly communicate cross-functionally, particularly with development teams.
  • Some experience with information security helpful.
  • A plus to have some experience and working knowledge of monitoring software (Nagios, Graphite).
  • Able to work with teams as well as independently with minimal supervision.
  • Exceptional work ethic, high sense of urgency, driven, self-motivated, highly accountable with strong initiative and passion.
  • Excited to learn new things and share knowledge and best practices with others.

Note for Principle Agencies - Principle agents should not forward resumes to The Control Group (TCG). TCG will not be responsible for any fees arising from the use of resumes submitted from agencies without a prior written and signed agreement and authorized job order for this position in place.

Thank You
Your application was submitted successfully.
Apply for this Job
  • Location
    San Diego, California
  • Department
    Reliability Engineering
  • Employment Type
    Full-Time Active
  • Minimum Experience
  • Powered by