.Net Posted Tuesday at 17:00 Posted Tuesday at 17:00 The Push Notification Hub (PNH) service recently went through significant modernization. We migrated from legacy components like .NET Framework 4.7.2 and custom HTTP server called “RestServer”, to .NET 8 and ASP.NET Core 8. Moreover, for handling outgoing requests, we moved from custom HTTP client/handler called “HttpPooler”, to Polly v8 and SocketsHttpHandler. This article describes the journey thus far and its impact on PNH performance. What is PNH and how does it affect Teams users? First, we should start with a description of what PNH is and what role it plays in Microsoft’s real time communication infrastructure. PNH is a central and critical component in distribution of any kind of event notifications to the end users and currently handles traffic for Teams, Skype and a couple of other applications from Microsoft’s portfolio. These notifications can be delivered via actual push channels like FCM (Google’s Firebase Cloud Messaging for relaying messages to Android users) or APNS (Apple Push Notification Service for relaying messages to iOS) but first and foremost as real-time notifications through Microsoft’s internal WebSocket channel. This real-time channel is used when app is being presented to the user and the background push channel is used to reach applications when it is in the background, for example on mobile devices. All communications, including messages, calls and meetings are ultimately directed to PNH in the form of events. PNH then loads a list of registered devices for the target user, notification configuration and other metadata. Utilizing the provided information (transformed input message payload, channel configuration etc.), the requests are constructed and subsequently dispatched to the designated notification channels. The send-message flow can be illustrated by this high-level diagram: In reality, PNH part can be broken down into several subservices and flows, for example just the mechanism of discovering the list of devices that need to be contacted is a whole separate service. So, PNH serves as a conduit for any kind of push events to be delivered to chat room members. Be it text messages in chats, channels, meetings, or typing/calling notifications. Yes, even calling notification can be thought of as a special message (it’s essentially a signal for Teams mobile app to display incoming call screen and start ringing your phone). The nature of these messages places some unique requirements for PNH. There are two main scenarios. Calling messages want least amount of latency (e.g. target device to start ringing as soon as possible). On the other hand, text messaging tends to prioritize throughput over latency. To give you a general idea about the volume of traffic…on a typical day, PNH makes HTTP requests counting in the hundreds of billions! Therefore, health and performance of PNH are important factors in overall application experience. They have a direct impact on how quickly and reliably users receive notifications. High resource consumption for PNH means more requests are going to end up in the queue, resulting in more timeouts and missed delivery deadlines. The service is also costlier to run, which negatively impacts scalability. Our expectations PNH was slated for migration (along with other services) to drive down operational costs and also to bring in latest tech and security improvements introduced in .NET Core. It was running on legacy stack, using .NET Framework 4.7.2 ecosystem. Because of this basis, many other libraries were out of date, missing performance and, perhaps more importantly, security improvements. By observing other services, we expected at least 25% improvement in “Q-factor” post migration to .NET Core stack. What is Q-factor? Q-factor is a metric by which we can measure performance evolution. General formula: Q-factor = (Requests served) / (CPU consumption) For PNH, the formula computes the “Work done” per “Resources spent” so it can increase if the service handles more requests with the same CPU consumption, or if the same volume of request is handled more efficiently. This value can then be used to judge relative improvements (or degradations) in performance after some feature is rolled out. Migration phases Our journey to .NET Core can be broken down into several distinct phases. Let’s go over each phase in more detail. Start of the migration We used RestServer as the HTTP server/listener component to facilitate incoming requests. To make outgoing HTTP requests we used HttpPooler. This was the old component built with some basic resiliency capabilities, most of which carried over to its spiritual successor, internally named R9 that was built on top of Polly v7. Parts of R9 were then later integrated into Polly v8. HttpPooler used stock HttpClientHandler to transfer the requests over the wire. Both RestServer and HttpPooler did not have any easy replacement on .NET Core. That would normally have forced us to branch our code heavily and use different components on each platform. Given the extensive integration of these elements into the service, we ultimately decided against proceeding with this. Instead, we opted for first migrating off of RestServer/HttpPooler to ASP.NET Core 2.2 and R9 coupled with WinHttpHandler respectively. The reason is that these new components can be used as-is on both .NET Framework and .NET Core. Although .NET Standard 2.0 packages have been released, they are quite outdated nowadays. Completing this preparation step gave us much more comfortable foundation that we could continue the .NET Core migration on and also gave us confidence during rollouts that both versions of our service will behave the same with regards to business logic (ie. less regressions and rollbacks). Initial phase So the first step towards .NET Core was to get rid of RestServer (handles incoming requests) and HttpPooler (handles outgoing requests and resiliency). We went through the code (~190k lines according to VS Code Metrics) and made tons of necessary changes as these components were embedded quite deep. RestServer replacement RestServer was replaced with ASP.NET Core 2.2, latest version to support both .NET Framework and .NET Core (later versions dropped the .NET Framework support). As server implementation we used the widely adopted HttpSys, despite wishing for Kestrel instead. This was because HttpSys resembled RestServer’s behavior a bit more and also that, at the time, Kestrel had serious performance degradation issue on our hosting platform due to negative interaction with Windows Defender. HttpPooler replacement HttpPooler was replaced with combination of R9 (handles HTTP resiliency, like retries, rate limits etc.) and WinHttpHandler (handles over-the-wire HTTP communication). NoteR9 (Rejuvenate) is a .NET SDK designed to provide a strong foundation upon which high-performance and high-availability services can be built. R9 strives to insulate services from the nitty-gritty details of the platform they are executing on, and includes a growing set of utility features which have proven valuable to service developers. The HTTP resiliency components are based on Polly v7. It’s worth mentioning that our expectations regarding performance increases were initially conservative. After all we were replacing already fine-tuned components. Indeed moving to ASP.NET Core 2.2 and WinHttpHandler did not come with any unexpected performance fluctuations. However, we were later pleasantly surprised after replacing WinHttpHandler with .NET Core’s own SocketsHttpHandler (more on that later in this article). A lot of these changes took the whole rollout roundtrip: Merge new code Deploy it to live servers Gradually enable the feature following safe rollout practices (takes its time) Verify it works and is stable under normal conditions, iron out any bugs Cleanup old code Transition to .NET Core runtime During this phase we slowly transitioned from .NET Framework (aka NetFx) to .NET Core runtime. First step was to introduce multitargeting to our projects so that the entire business logic can be compiled under both .NET Framework 4.7.2 and .NET 8. Next, we added the actual .NET 8 implementation of our service and a switching mechanism (based on deployment variables) that allowed us to transition any of our servers back and forth. Then after world-wide rollout, we’d do the legacy code cleanup. Runtime switch Q-factor impact Significant performance gains were expected as runtime and BCL in .NET 8 vs. Framework 4.7.2 has been improved in many ways. This was later confirmed as the switch had a profound, positive effect on Q-factor where we saw a big and clear jump up (the good direction). Consider the following graph that covers 3 peak traffic periods, with obvious Framework <-> Core transition point in the middle, measured on one of our production deployments: NoteFeel free to ignore the unit of the Y axis, it is not important, what is important is the relative change between .NET Framework and .NET Core. Let’s take a look at Q-factor numbers over a typical business weekend: Platform Mean Q-factor Improvement .NET Framework 4.7.2 1573 .NET 8 2331 48% This means that just by switching runtimes, our Q-factor went up 48%! Not too bad considering it’s basically the same business logic code (save for couple of minor compatibility fixes and if-defs). On latest tech Being done with .NET 8 rollout and having finally removed the legacy components from code enabled us to continue to switch to some of the new technologies at our disposal. ASP.NET 8 and Minimal API ASP.NET Core 8 brings tons of features, fixes, and improvements, and opens the door to further optimizations into the future like Native AOT. Minimal APIs is new way to define endpoints and routing in an ASP.NET Core application. It is designed to be high-performance and simpler to use than its predecessor, the classic MVC (model-view-controller pattern). Minimal API allowed us to completely remove controllers and controller-based routing in PNH. Our endpoint definition code was slashed down from several code files to around 1 page of text. Another benefit that we immediately took advantage of is the ability to define multiple incoming request pipelines based on the URL of the request. This allowed us to restrict full pipelines only to actual business endpoints, while technical endpoints (like health checks) only have simplified (faster, less memory allocations) pipeline defined. More to the point, requests with invalid paths or verbs (think possible attacks or automated vulnerability scanners) have virtually empty pipeline and are cheaply discarded. SocketsHttpHandler Let’s talk about improvements to our outgoing HTTP pipeline. Http handler is a software component at the end of the pipeline (after telemetry/resilience and other components) that is responsible for actually sending the requests over the wire to the intended target. We used WinHttpHandler as the go-to handler (for outgoing traffic) after HttpPooler. It served its purpose being multi-platform and mature tech. Once on .NET 8 however, we could finally switch to the SocketsHttpHandler, the recommended handler for .NET Core platform built from the ground up with strong focus on performance and reliability. This brought improved performance, reliability, security as well as compatibility (supporting the newest HTTP standards like HTTP3, etc.). We actually discovered and reported a new bug in WinHttpHandler, regarding client certificate corruption under very heavy load. SocketsHttpHandler does not have this issue as it is handling certificates carefully behind the scenes. The effects of introducing SocketsHttpHandler were quite significant. Both Q-factor and latency improved greatly across the board (to be more precise, its 99th percentile latency of all successful calls). Let’s go over some highlights. SHH Q-factor impact ImportantThese improvements are measured on top of .NET 8 runtime switch (that brought its own set of performance benefits). Q-factor impact over a typical week while doing gradual rollout: HTTP handler Mean Q-factor Improvement WinHttpHandler 2264 SocketsHttpHandler 2744 21% SocketsHttpHandler is a clear winner here in terms of raw performance. Together with the runtime switch, it increased Q-factor of PNH by around 70%! APNS latency Let us demonstrate the impact of SocketsHttpHandler on one of our more network heavy integrations, the Apple Push Notification Service (in effect whenever you message or call some iOS device). APNS uses HTTP 2.0-based binary communication which in itself poses unique challenges as well as optimization opportunities for the HTTP handler component of your choice. With switching to SocketsHttpHandler, latency for successful requests dropped almost by half. It does show that SocketsHttpHandlers stack is much better optimized! HTTP handler Mean P99 latency Improvement WinHttpHandler 99.8 ms SocketsHttpHandler 61.1 ms 39% Realtime notifications latency Realtime notifications are especially important, they are latency-sensitive (for example calling) and they constitute for most of our world-wide traffic (even more than APNS). We are happy to report that this traffic experienced quite significant latency reduction in the range of hundreds of ms! To put it into numbers: HTTP handler Mean P99 latency Improvement WinHttpHandler 506.4 ms SocketsHttpHandler 329.2 ms 35% This is big improvement in push notification latency that could be felt world-wide! Polly v8 NotePolly is a .NET resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner. .NET 8 enabled us to use the latest version of R9 at first, but then later its de facto successor, the Polly v8. This resulted in retiring even more legacy code while improving overall reliability and security of PNH. Polly had profound, positive impact on memory allocations. QFactor stayed more or less the same but .NET 8 event counter showed big improvement. Average heap size: Closing thoughts PNH is deriving great benefits from .NET 8. Overall performance improved, as evidenced by the Q-Factor metric, by about 70%. Performance is a major factor for a service like this and will reflect positively in basically all flows on Teams platform that got to do with messaging. The results actually exceeded our expectations by significant margin. Essentially, PNH is now faster and cheaper, improved latency means everyone can now enjoy snappier calling and messaging notifications. Reduced resource consumption means we can afford more servers for improved redundancy (think less outages). It can also translate to denser global coverage, further reducing latency wherever the user might be located on the globe. And that is just first of a series of milestones… Next steps Now that PNH is on .NET 8 and .NET Framework code is all cleaned up (we no longer need to support it) our hands are untied to adopt even more cool technologies! Sneak peek of our future plans: Migrate to System.Text.Json PNH currently uses Json.NET (a.k.a Newtonsoft.Json) to do all its json (de)serialization (and there’s a lot of that going on behind the scenes as we rely on JSON for all requests and responses). System.Text.Json proves to be superior in terms of performance. It also has great support for async code flows and has methods that are optimized for being awaited. This is important to high-load API like PNH that uses async code heavily as it helps to smooth out thread utilization and avoid thread starvation. Utilize Span<T> and Memory<T> tooling PNH does a lot of string processing and reprocessing. Parsing, injecting, removing and transforming data from one form to another. These operations could benefit greatly from new tools like Spans. Concepts like memory slicing, stack-allocated memory or parsing directly from spans can bring substantial improvements in CPU/memory consumption, further driving down operational cost. Native AOT Native Ahead-Of-Time compilation is something we’d like to explore. It has potential to improve startup and runtime performance. This is possible on .NET 8+ as it adds support for native AOT for ASP.NET Core. Possibility to host on Linux Bringing PNH to .NET 8 means we are now on a actual cross-platform framework. Concretely it opens the way to host our service on Linux that has potential to further improve performance and overall stability of the service. Kestrel We are using HttpSys to listen to and process incoming connections. Kestrel is lightweight and performance-focused alternative. Advantages of Kestrel include: High performance and low overhead. It’s optimized for handling a large number of concurrent connections, making it suitable for high-traffic applications. Designed to run on multiple platforms, including Windows, Linux, and macOS. This makes it a versatile choice for applications that need to be deployed across different environments. Integrates better with ASP.NET Core. Expanded configuration options. Connection middlewares! Does not need admin rights to listen to port numbers under 1024. Actively developed and maintained with the latest security patches and standards. .NET 9 The newest version of .NET was released during the writing of this article. It offers an optimized runtime and provides opportunities for cost savings through fine-tuning PNH code. Conclusion The modernization of PNH has been a significant step forward for our team. By leveraging .NET 8, we’ve achieved notable improvements in performance, scalability, and efficiency. .NET 8 also brought much-needed security enhancements to the critical components that our code uses and brought many new exciting language features that are paving the way towards further performance optimizations and modern C# code practices. These changes directly enhance the experience for Teams users, ensuring faster and more reliable notifications. As we look ahead, we’re excited to explore the possibilities that .NET 9 and other emerging technologies offer. The journey of modernization is ongoing, and we’re committed to continuously improving our services. We’d love to hear about your experiences with modernizing your applications or adopting the latest .NET technologies in the comments below. The post Modernizing push notification API for Teams appeared first on .NET Blog. View the full article Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.