AWS cloud best practices for a relief community platform | Case studies

The Challenge

Innovations made possible by public clouds are commonly associated with consumer centric products and services primarily targeting developed markets. However, their actual usefulness and viability extend far beyond the most obvious and widely recognized cases — and Talk to Loop proves just that.

As a digital community feedback platform which piloted in Zambia in 2020, Talk to Loop bridges the gap between local communities and international humanitarian and development organizations, transforming their mutual communication with a heavy focus on accountability, transparency and accessibility.

Its mission comes with a unique set of challenges — especially when its success is not measured in dollars, but in meaningful interactions between survivors in need and the people seeking to provide impactful relief.

With official launches in Zambia, Indonesia, Philippines, Somalia, Poland and Ukraine, the platform already fully supports 15 languages via 6 communication channels (web, SMS, Facebook Messenger, WhatsApp, Telegram and IVR) for more than 750 organizations worldwide.

As all those numbers keep increasing, Elite Crew — the developers of the platform — have been relying on AWS from the very beginning, using its broad catalog of services to provide the necessary infrastructural backbone, and to accelerate its development thanks to the availability of numerous services which perfectly complement the platform’s requirements.

Naturally, increasing numbers translate to increasing scale — not just in terms of traffic, but in terms of complexity. During initial development, prototyping and pilot launches, issues which creeped into a project often remain unnoticed due to their low impact, and even get forgotten entirely over time. However, small issues are only actually small as long as the scale is just as small. What slipped through the cracks initially can become exacerbated with the scale of success — and often turn it back into failure.

We usually measure downtime in monetary losses. However, Talk to Loop uses modern cloud features to help international organizations fight against the digital divide, and often does so in regions affected by disasters and conflicts. The unreliability of the service can undermine trust in not just the platform, but in the relief organizations providing it. Where availability can mean the difference between life and death and survivors fight against the odds to access help, common IT challenges take on a whole new meaning.

To minimize the risk of ever letting an IT disaster turn into a real-world disaster, Elite Crew asked Chaos Gears for assistance in reviewing the platform, remediating any potential issues with its setup and optimizing its costs, which — given the platform’s characteristics and policies — is equally crucial to ensure it’s long-term success.

The Solution

While Talk to Loop was our primary focus from the start, our initial meetings quickly established that a holistic approach to Elite Crew’s infrastructure will be beneficial in the long term. We were tasked with reviewing it with a heavy focus on security, while simultaneously looking for cost optimization and automation opportunities. In particular, a well-structured Operational Readiness Review with clear guidelines on security and administrative requirements was a top priority.

Chaos Gears started the engagement by conducting a thorough architecture audit of our existing systems. This audit was an eye-opener, as it highlighted various inefficiencies and potential security vulnerabilities in our infrastructure. Their team of experts demonstrated an impressive depth of knowledge and attention to detail.

— Marek Wrzosowski, CEO, Elite Crew

For a modern software house like Elite Crew, development efficiency and velocity are paramount. When infrastructure is automated, reliable and scalable, developers can focus on actual application code and innovate with confidence, which directly translates into new on-demand features for platforms like Talk to Loop. This confidence, however, is only possible when everyone is on the same page with regards to fundamental internal processes, and when practice is built on well-established guidelines which cover not just common day-to-day operations, but also critical conditions like disaster recovery.

With our partner’s primary concerns established, a Cloud Architect and an engineer responsible for conducting the AWS Well-Architected Review (WAR) joined forces with Elite Crew on this project.

The AWS Well-Architected Review we were asked to perform provides us with an industry-proven framework and set of guidelines that — amongst others — helps bring all stakeholders to the table in just the right way, uncovering palpable insights into the status quo along with potential improvements. While a slight simplification, it's designed to help identify and prioritize in a way that shortens the road from issues to best practice solutions.

However, the ability to deliver solutions — once — is just part of the equation.

Continuous best practices

While we’re always available to our clients, past and present, we never assume that we’ll also always be the ones working on the infrastructure — in fact, assuming the opposite is in itself an industry-wide gold practice.

The long-term efficacy of all solutions is contingent on stakeholders fully understanding why they had been implemented in the first place, and which problems they were meant to solve.

As such, when we work with a client, it is imperative for us to provide a flow of actionable knowledge and data as part of the solutions. Initial meetings and workshops, just as in this case, help us narrow down the problem space in order to outline key areas to focus on. Those are often relatively small remediations in terms of cost and effort which nevertheless have a high impact further down the road. And more often than not, their lack stems from organizational habits and practices.

Thus, what at first glance may just seem like a somewhat lengthy checklist of potential issues to review, in reality — for it to fully deliver on the value it promises — is also a knowledge transfer along with expert guidance on how to best implement proven practices in one’s organization. In short: at Chaos Gears, we strive to provide the fishing rod, not just the fish.

This approach is especially crucial in all matters security, compliance and optimization — all of which are continuous processes, and none have fully encompassing one-off solutions. They must continue to be properly practiced even when we’re no longer directly involved with a project.

In this case, for example, Elite Crew gained invaluable knowledge and tips on how to best utilize the services it employs on AWS, and not just those used by the Talk to Loop platform — e.g. how to best manage loads or how to coherently and systematically manage security policies.

As our partner manages a whole portfolio of AWS accounts, based on our experience we started by consolidating those assets under the umbrella of a single AWS Organization – primarily in order to improve visibility of costs, security and compliance of the entire collection. We then built upon this by introducing single sign-on authentication into the infrastructure via AWS IAM Identity Center in order to simplify the administrative flow. Our partner would now be able to manage its infrastructural portfolio under one collective configuration and without having to re-authenticate between numerous separate accounts — a convenience to administrators that is also a cost-reduction in practice.

Yet, such consolidation also comes with tradeoffs: previously separated accounts with unique credentials would now potentially have more access to assets across the entire organization, increasing the attack surface. This is a risk never to be taken lightly, given that compromised credentials remain the #1 source of security breaches worldwide.

Luckily, along with the convenience, AWS provides an easy to implement mitigation for this issue, which happens to also be an industry-wide best practice: multi-factor authentication. We, naturally, strictly enforced it for all IAM accounts — along with strong password requirements and an automated authentication key rotation schedule.

We then proceeded to review existing IAM accounts, focusing on their permissions and activity patterns, in order to identify potential security risks before pushing for a reduction in their overall count and their scope of access, in order to adhere to the principle of least privilege.

AWS KMS made it possible for us to seamlessly enforce encryption at rest for the platform’s primary data stored in AWS RDS relational database instances. In order to secure backups and other historical data, we applied the same approach to its Amazon S3 configuration.

The final piece of the fundamental security puzzle was ensuring that sensitive access to the infrastructure along with Talk to Loop’s private networking is both properly secured and seamlessly integrated. Due to how the project is structured and organized, we reached slightly beyond AWS’ catalog in this case, and opted to use the private networking integrations offered by CloudFlare.

All of these fundamentals combined, along with many changes not described here, formed a shift in governance practices that have a direct impact on the security of the platform as a whole.

Observe, learn, remediate

Once everyone is on-board and the benefits of such a shift become easily tangible, the logical next step is to work on ensuring those practices keep getting followed in the future — and this is where good governance combines with automated observability and monitoring to ensure compliance.

We started by setting up automated notifications about specific internal events, like e.g. the use of root accounts, which may (or may not) indicate a violation of the principle of least privilege in practice — but is not easily automatically remediated, as it sometimes is truly necessary.

In any case, such events must be observable from an organization’s point of view, and even where automated remediation is not possible — or not desired to begin with — proper auditing and reporting best practices help ensure that best practices in other areas, such as security, are adhered to in practice.

In the long term, they also help uncover deviations from guidelines. Those may turn into notable patterns that necessitate solving an issue which might not be immediately apparent from a singular event, as such can easily fall through the cracks.

While insofar we focused on internal governance, as proper organization of infrastructure is fundamental to all improvements, those very same principles apply to the actual applications and services that run on that infrastructure.

In an ideal world, we want to prevent issues before they even creep up. As far as cloud best practices go, teaming up with an expert partner like Chaos Gears from the very beginning is already a step in that direction.

However, as the world isn’t ideal, certain issues never show up unless under very specific conditions, while some can take years until they surface. And let’s not forget about the infamous Heisenbugs.

Where prevention is not possible, we rely on remediation. And just as an AWS Well-Architected Review serves to remediate issues on an organizational and infrastructural level at a certain point in time, it also serves to help an organization establish the proper means by which issues can be quickly and efficiently discovered and remediated at any point in time — if and when they surface.

Application logging and monitoring tend to appear trivial to laymen, but are complex problems in reality. Knowing what to log, when to log, and where to funnel it to makes the difference between terabytes of noise and terabytes of actionable data. Especially so in distributed computing using microservices in environments such as the cloud, where those problems can get further compounded by unreliable tracing.

While AWS, as a platform, cannot magically fix a lack of observability within an application — it does an excellent job of capturing any and all events outside of the application, within its infrastructure. Services like Amazon CloudWatch and AWS CloudTrail seamlessly integrate with the entire platform and make observability on this level a breeze, while also making it possible for applications to directly integrate with those services to store, filter and visualize operational data.

In this case, proper configuration of those services performed by our experts ensured that Elite Crew now has constant, on-demand awareness of the who, what and when— including automated workflows for certain events and alarms about operational anomalies.While Talk to Loop was our primary focus from the start, our initial meetings quickly established that a holistic approach to Elite Crew’s infrastructure will be beneficial in the long term. We were tasked with reviewing it with a heavy focus on security, while simultaneously looking for cost optimization and automation opportunities. In particular, a well-structured Operational Readiness Review with clear guidelines on security and administrative requirements was a top priority.

The Outcome

Many relatively small remediations and tasks shaped by our initial meetings and the subsequent Well-Architected Review ultimately formed a complete picture, as we — together with our partners at Elite Crew — met all the goals we were tasked to achieve.

During our cooperation, we helped Elite Crew incorporate many security and governance best practices into their day-to-day work, putting an emphasis on data-driven flows based on proper monitoring and reporting patterns. Ultimately, whether it is disaster relief in the physical world or in digital infrastructure, the efficacy of remediation actions is contingent on proper know-how and the right information, at the right time.

Furthermore, a push for the documentation of internal processes resulted in the automation of many of them — e.g. via runbooks and playbooks — while security hardening in several areas was performed using modern solutions that are also convenient in practice. It is an often overlooked cost of security solutions, in that prohibitively high walls prompt our natural, human inclination to avoid the hassle altogether. However, when implemented properly, security doesn’t have to come at a productivity cost.

Most importantly, we provided Elite Crew with the necessary tools and resources to tackle potential future issues — especially with regards to ongoing security best practices and threats — without our direct involvement. The positive outcomes of our cooperation spill over into all projects in Elite Crew’s portfolio.

For Talk to Loop, for instance, it means that it can continue to push the boundaries of how international humanitarian and development organizations interact with local communities, increasing their ability to provide help where it’s needed the most in direct coordination with the people they serve — in part thanks to the scalable, highly-available, reliable and durable infrastructure the platform is running on, and the automated procedures and operational guidelines we helped put in place.

Chaos Gears is an absolute expert in cloud infrastructure development and architecture audit. Working with them has been an exceptional experience from start to finish, and I cannot emphasize enough the value they brought to our Talk to Loop project and Elite Crew organization.

What sets Chaos Gears apart is their unwavering dedication to client satisfaction. They maintained open lines of communication throughout the project, ensuring that we were well-informed at every stage. Their team was responsive to our questions and concerns and worked tirelessly to meet our project milestones on time and within budget.

Working with this organization undoubtedly provides two major added values. The first is the unquestionable level of technical expertise, while the second, which I must emphasize, is Chaos Gears' client-focused educational approach to its partners. Thanks to this approach, the level of awareness of the services offered by AWS has increased significantly in our team.

I look forward to partnering with them on future endeavors.

— Marek Wrzosowski, CEO, Elite Crew

How Elite Crew built high-performing, well-architected infrastructure

The Challenge

The Solution

Continuous best practices

Observe, learn, remediate

The Outcome

Trusted tools

Case studies

How Clariant built a generative AI platform on AWS

How KLER's e-commerce grows with cloud and DevOps best practices

How DevOps culture turns our client into a next-generation telecom

Cookies