Site Reliability Engineer
Company: Han IT Staffing
Posted on: November 25, 2022
Site Reliability Engineer
Malvern, PA- Remote For Now
PLZ Read THIS : We need people who have strong Java, NodeJS and AWS
app dev skills, familiar with resiliency architecture patterns, and
have good understanding on observability, scalability,performance
and resiliency engineering.
--- Are you an engineer who loves to solve impactful complex
--- Are you passionate about finding opportunities to improve
system performance and efficiency, scalability, fault tolerance,
and self-healing capabilities?
--- Are you excited about Chaos Engineering? Do you want to apply
these principles and creatively experiment with our systems to
Client hidden weaknesses?
--- Are you obsessed with understanding systems inner state,
interactions between systems or observability-driven
If the above holds, then the Lead Site Reliability Engineer
opportunity at Vanguard is for you! A successful candidate will
likely have experience in being a Full Stack Engineer who has
supported their applications operationally. You will be solutioning
reliability problems across product families and continuously
seeking opportunities to improve our systems' "-ilities". You will
also help define, maintain, and carry out subdivisional reliability
engineering standards, contribute to enterprise-wide libraries for
reliability, and train product SRE and product family SRE leads
within the subdivision.
In this role you will:
1. Instrument, enhance and advocate for system observability.
Identify and develop solutions to bridge systems observability
2. Collaborates with internal teams to evaluate the health,
stability and reliability of systems/platforms. Looks for
opportunity to improve system performance efficiency and
3. Develops and communicates new standards and newly available
tools and frameworks across subdivisions. Enforces reliability
standards. Designs and develops new automated solutions for
4. Provides technical leadership, consultancy, and coaching on
designing and implementing both traditional and serverless
architectures in AWS with an emphasis on repeatability, scaling
options, resilience, reliability, telemetry, networking, etc.,
including design patterns for resilient systems
5. Leads failure modes analysis spanning product families when new
features and architecture patterns are introduced. Facilitates
post-incident reviews for any high severity client impacting events
local to the product family.
6. Leads cross-product or cross-subdivision chaos
7. Designs, reviews, and coaches others on performance tests using
appropriate components (e.g., requests per minute, # of threads,
the construction of a request with headers and cookies)
8. Consults, reviews, coaches, and influences architectural
decisions, including non-functional aspects, proposing potential
technical solutions/enhancements, and explaining convincingly which
is better and why.
9. Contributes to or leads Reliability Engineering and Resilience
practice. Remains informed about site reliability engineering
activities happening within the subdivision.
10. Works with product owners to set subdivision goals for higher
availability and SRE impact, and tracks progress toward achieving
11. Provides technical leadership, guidance, consulting, training,
and governance on SRE to one or more product families in a
12. Identifies opportunities to automate away toil and develops
solutions, monitors error budget exhaustion rates, configures auto
scaling thresholds for the product, and incorporates resilience
patterns, such as circuit breakers, into the application code.
Develops complex deployment and/or routing strategies for high
13. Maintains and looks for opportunities to improve centralized
incident response playbook for the subdivision to document
standards for managing communication and escalation during an
14. Oversees blameless post-incident reviews for high severity
incidents involving more multiple product families.
Core Responsibilities/ Qualifications
--- Minimum of eight years related work experience, with at least
three years of development experience.
--- Undergraduate degree or equivalent combination of training and
experience. Graduate degree preferred.
--- Full stack development - JDK8+ preferred with spring boot, Rest
APIs, multithreaded, multiprocessing applications, Graphql.
Experience with UI development (familiar with Angular, TypeScript,
NodeJS etc.) is a plus.
--- Ability to diagnose and resolve problems in high-throughput
--- Experience with one or more observability frameworks or tools -
Experience with OpenTelemetry (java, js, etc.), Cloudwatch,
Grafana, Splunk, etc.
--- Exposure to *nix environments including some shell script
development and basic command execution.
--- Strong understanding of database principles and working
knowledge in distributed storage and infrastructural solutions.
--- Experience with container management and micro-services
architectures such as Docker in cloud and on-premises
--- Working knowledge of AWS network foundations, application
networking, edge, and network security.
--- Excellent communication, and documentation skills.
Looking for lead - expert level software engineers or hands-on
Keywords: Han IT Staffing, Wayne , Site Reliability Engineer, Professions , Wayne, New Jersey
Didn't find what you're looking for? Search again!