The Control Group logo

The Control Group

Sr. Site Reliability Engineer (SRE, DevOps)

Reliability Engineering – San Diego, California
Department Reliability Engineering
Employment Type Full-Time Active
Minimum Experience Experienced

Sr. Site Reliability Engineer (SRE, DevOps)

The Control Group (TCG) is entering an exciting growth phase in 2019, and we are looking for a talented, collaborative Senior Site Reliability Engineer (DevOps) to join our award-winning team. Would you like to work in a true Development + Ops environment? Do you geek out on data – does the thought of working with petabytes of analytics data blow your mind? How about working with 300+ servers daily and economies of scale that are insane? Do you want to play with the Big Dogs?!?! If so, then read on!

TCG is an award-winning web development company with over 15 million customers nationwide. Our cutting-edge technology keeps people safe — both online and off. Our websites are consistently ranked in the top 500 top traffic sites in the US. Our products have been featured on the Discovery Channel, Mashable, Vice, Entrepreneur, Business Insider — and even made a cameo in a Disney animated comedy! A pioneer of new ideas, we’re constantly looking to develop and deploy innovative strategies and solutions. Our people and culture are second to none: we’re innovative, creative, collaborative and talented. We work hard, play hard, and together — we work magic!

Join our amazing team in our brand-new state-of-the-art office with stunning views of beautiful downtown San Diego. Our dog-friendly office is packed with snacks and crazy-good perks (like free massages, kombucha on tap, free catered lunches, ping pong, video games, offsite team events and more)! We offer a highly competitive salary + bonus package, 100% company paid health insurance (Medical, Dental, Vision), UNLIMITED vacation, Paid Sick Leave, Paid Holidays, Student Loan Repayment Program, 529 Education Savings Plan, Training/Education Reimbursement, free Gym Membership, Paid Parking and 401k Plan with Company Match. Check us out at

You will already have outstanding experience setting up and automating high availability data clusters and working with virtualization and container technologies. You are expert in scripting languages Python and BASH. You relish complex technical challenges and the adrenaline rush when your skills and experience are pushed to the limit. You are passionate about collaborating with Devs to bridge the gap between development and process. You understand how to develop business-centric processes that convert to profitability. You are not satisfied with using only traditional scripting languages and tools, but are always seeking out and learning new technologies.

Responsibilities Include (but not limited to):

  • Manage security of production infrastructure, systems and application in compliance with PCI requirements
  • Use a holistic understanding of technical environments and business outcomes to help the company move to faster test and highly successful product deployments
  • Ensure that sites and systems continuously and consistently run smoothly, optimally, efficiently and reliably
  • Collaborate with Development teams to create new and continuously improve products, including planning, testing, staging and deployment
  • Proactively and regularly communicate with Development teams to ensure that new (or improved) software works efficiently across diverse operating systems and platforms
  • Manage large systems and maintain very high-quality end user experience even while introducing new features
  • Develop new features, scaling, automation and self-healing processes for sites and systems
  • Automate configuration management, deployment of product releases and provisioning of servers in development, QA, staging and production environments
  • Troubleshoot and restore sites to full performance optimization, as required
  • Work with hosting provider to manage/grow hardware on dedicated provider cloud
  • Optimize and audit server infrastructure/configuration and develop/implement improvements to network architecture
  • Support clustered web environment that receives 500-1000 requests per second
  • Manage DNS and firewall rules
  • Other duties as required


  • 5+ years’ experience in Reliability Engineering and/or DevOps, at least 3 years with high volume websites
  • Must have strong experience with information security and working knowledge of PCI
  • Expert in scripting languages Python and BASH
  • A plus to have some experience with and working knowledge of Golang
  • Strong experience running production workloads in a cloud environment
  • Strong experience using virtualization software such as VMware, OpenStack, VirtualBox
  • Solid experience maintaining and debugging applications built in PHP, Node.JS, or Golang
  • Solid experience with automation and configuration management tools such as Saltstack, Ansible, Vagrant, Jenkins, etc.
  • Solid experience with container technologies such as Kubernetes, Docker, Nomad, OpenStack, Vagrant
  • Production experience in designing, deploying and administering complex cloud applications (Google Cloud Platform, Consul and Terraform)
  • Strong experience setting up and automating high availability data clusters (MySQL, PostgreSQL, Redis, ElasticSearch, etc.).
  • Experience with:
  •      o   Linux Systems Administration (Ubuntu, Red Hat)
  •      o   Web Serving Software/Architecture (Nginx, PHP-FPM, Redis)
  •      o   Concurrent Versioning Software (Git, GitHub or Bitbucket)
  • Comfortable with frequent, incremental code testing and deployment.
  • Familiar with network debugging tools like Wireshark, Traceroute, Tcpdump, Dnsmasq
  • Strong ability to collaborate and openly communicate cross-functionally, particularly with development teams.
  • Experience and working knowledge of monitoring software (Nagios, Graphite)
  • Able to work with teams as well as independently with minimal supervision
  • Exceptional work ethic, high sense of urgency, driven, self-motivated, highly accountable with strong initiative and passion
  • Excited to learn new things and share knowledge and best practices with others

Note for Principle Agencies - Principle agents should not forward resumes to The Control Group (TCG). TCG will not be responsible for any fees arising from the use of resumes submitted from agencies without a prior written and signed agreement and authorized job order for this position in place.

Thank You
Your application was submitted successfully.
Apply for this Job
  • Location
    San Diego, California
  • Department
    Reliability Engineering
  • Employment Type
    Full-Time Active
  • Minimum Experience
  • Powered by