SpotMe Studio architecture

Introduction

SpotMe Studio streamlines the process of planning and executing exceptional virtual events, eliminating the need to juggle numerous tools and services. It operates entirely within the browser, offering a comprehensive suite of features tailored for online events. From video conferencing to on-demand recordings, analytics, applause, Q&A sessions, polls, and more, SpotMe Studio provides everything you need in one convenient platform.

The infrastructure supporting SpotMe Studio is built on foundational components and is meticulously managed by the SpotMe engineering team. This architecture grants SpotMe full control over streaming formats, bitrates, and facilitates the implementation of advanced features such as live interactivity synchronization and seamless delivery of closed captions, as well as many other sophisticated features that require having low-level access on the stack.

Importantly, this approach ensures that all data points collected during events remain under SpotMe's control, avoiding reliance on third-party involvement. By using low level core components and non aggregated services, SpotMe maintains full ownership and control over event data, reinforcing data privacy and security for both SpotMe and its clients.

Overall studio architecture

Open the diagram in full screen

All underlying Studio components are automatically scaled and geographically distributed, and publish advanced telemetry data. This setup not only enables full observability, but also enhances reliability through automated recovery processes. The system is designed to handle failures autonomously, ensuring continuous operation, with a prompt and adequate response to load surges.

WebRTC Infrastructure

Purpose

Our WebRTC infrastructure serves as the backbone for connecting remote speakers during live virtual events, offering a seamless experience without the need for participants to install third-party software on their desktops. Leveraging WebRTC's native integration in most modern browsers, our platform provides a comprehensive browser-based event production capability.

This enables all event management tools to be readily accessible within a single web application page. From moderating and displaying participant questions, launching polls, and seamlessly transitioning between slides, to integrating pre/mid/post-roll videos, managing lower thirds, and monitoring participant engagement levels, our WebRTC-powered solution empowers event organizers with unparalleled control and flexibility.

Multi-layered security

Security within the WebRTC framework is implemented across various tiers. These protective measures encompass the restriction of endpoint access to sessions, adoption of a role-based security structure, and safeguarding the fundamental voice and video data traversing the cloud infrastructure and interconnecting endpoints.

The WebRTC framework relies entirely on established, open standards crafted by industry veterans and leveraged in commercial applications for an extended period. The fundamental protocols underpinning the WebRTC infrastructure security comprise SRTP for media traffic encryption and DTLS-SRTP for key negotiation, both standardized by the IETF.

Encryption

Within the WebRTC infrastructure, endpoints compatible with WebRTC use the AES cipher with 128-bit keys for the encryption of audio and video data. Additionally, HMAC-SHA1 is employed for data integrity verification.

In peer-to-peer connections, including those relayed through cloud-based TURN servers - if a symmetric or full-cone NAT, or a complex corporate firewall is used - endpoints within the WebRTC infrastructure initiate sessions with randomly generated keys. Moreover, these keys undergo periodic rotation and renegotiation throughout the conversation, further fortifying security. Notably, the keys are ephemeral, and are valid for only a brief duration and are not retained or stored anywhere within the system.

Therefore, the VP8 or VP9 encoded A/V bitstream in the SRTP payload is never decrypted and is transferred encrypted end-to-end. The SFU (selective forwarding unit) servers and TURN servers simply relay content. Only the headers of the RTP packets are not encrypted, which allows users to benefit from a simulcast technology. Namely, SFU uses information in the RTP headers (such as payload type, SSRC, sequence numbers, and timestamp) to differentiate between the various quality layers of a video stream, perform latency management, and make decisions about which streams to forward to various participants based on their network conditions, device capabilities, or user preferences. With this approach, we always ensure the best possible experience for our users.

Cloud compositing service

The cloud compositing service is responsible for blending and rendering the final video stream of speakers connected via WebRTC. This involves integrating each speaker's audio and video stream (including any screen share stream if enabled), the background image, and slides previously uploaded by speakers, which may be displayed intermittently under the control of the speakers. Additionally, any configured lower thirds are incorporated into the stream as required.

After composition, the resulting video frame, along with its associated audio, undergoes encoding. The video is encoded to H.264, while the audio is encoded to AAC. Subsequently, both the video and audio streams are packaged within an FLV container for seamless transport over RTMPS. This prepared content is then sent to the SpotMe adaptive streaming encoding and packaging infrastructure for further processing and distribution. Cloud compositing service is automatically scaled based on demand or if a sudden load surge is detected.

Adaptive streaming (HLS) encoding and packaging

Transcoding and packaging

When the RTMPS live feed enters the video streaming pipeline, it undergoes adaptation and transformation to ensure compatibility with a wide array of end-user devices. To achieve this broad compatibility, SpotMe relies on HTTP protocol for delivering audio and video content to its users. Leveraging adaptive streaming HLS technology pioneered by Apple, SpotMe optimizes content delivery in order to cater to diverse device types.

Following the processing and decoding of RTMP packets, the subsequent step involves demultiplexing the container (FLV) to separate audio and video streams for encoding. From the initial audio and video bitrates, the transcoder dynamically generates multiple alternative bitrates/resolutions. Currently, our service offers the following variant bitrates/resolutions:

640 x 360 at 1.0 Mbps
768 x 432 at 1.5 Mbps
960 x 540 at 1.8 Mbps
1280 x 720 at 2.3 Mbps
1920 x 1080 at 4.0 Mbps

This adaptive streaming approach ensures optimal viewing experiences across a wide range of devices and network conditions, providing users with seamless access to high-quality audio and video content.

Packaging

After transcoding, the alternative bitrates produced by the transcoder are encapsulated into an MPEG2-TS container and fragmented into smaller video and audio segments, each lasting 6 seconds. These segments are then organized into variant playlists, with each playlist referencing a depth of 30 seconds, equivalent to 5 MPEG2-TS segments. These variant playlists are further referenced in a master playlist, providing a hierarchical structure for efficient content delivery.

The end-user player retrieves the master playlist, allowing it to access the available variant bitrates and dynamically select the most suitable one based on network conditions and device capabilities. This adaptive selection process ensures seamless playback, preventing buffering interruptions. Adaptive streaming algorithms employed by players vary across platforms, with download speed and buffer fill rate serving as primary determinants for bitrate selection.

During periods of network congestion or reduced bandwidth, the video quality presented to the user adjusts accordingly. If network bandwidth falls below a critical threshold, video playback may be temporarily suspended to prevent buffering issues and ensure a smooth viewing experience.

Live stream distribution - CDN

Once HLS playlists and corresponding segments for various alternative bitrates are generated, these assets are transmitted by the packager to a CDN (Content Delivery Network) origin. From this origin, the segments and playlists are propagated initially to regional edge caches, and then distributed across a network of edge servers (POPs) strategically positioned across different geographical locations. The primary objective is to minimize the distance between end users and the content, thereby reducing download latency and enhancing streaming performance.

When a player initiates a request for a resource, the CDN dynamically determines the optimal edge server in terms of latency and proximity to the end user. This decision is typically based on factors such as the geographic location of the end user, proximity to nearby edge servers, and the current load and traffic conditions of these edge servers. By selecting the closest and most suitable edge server, the CDN ensures efficient content delivery, contributing to a smoother streaming experience for end users.

Audio and video consumption

As previously mentioned, when an end user intends to stream a specific live audio-video feed, the master playlist is first downloaded to retrieve available variant bitrates and associated segments. However, before accessing any of these resources, SpotMe's end-user player must acquire a signed URL cookie, granting authorized access to CDN resources. This mechanism ensures that only authenticated users can access the intended resources, enhancing security and protecting content integrity.

Furthermore, it's worth noting that both playlists and associated segments are delivered over HTTPS for added security. This, coupled with the requirement for a signed cookie to access CDN resources and the utilization of RTMPS for input into the transcoding pipeline, guarantees end-to-end security from content input to delivery. By employing these stringent security measures, SpotMe prioritizes user privacy and safeguards against unauthorized access or tampering of live audio-video streams.

In this article

Introduction

Overall studio architecture

WebRTC Infrastructure

Purpose

Multi-layered security

Encryption

Cloud compositing service

Adaptive streaming (HLS) encoding and packaging

Transcoding and packaging

Packaging

Live stream distribution - CDN

Audio and video consumption

Comments

In this article

Introduction

Overall studio architecture

WebRTC Infrastructure

Purpose

Multi-layered security

Encryption

Cloud compositing service

Adaptive streaming (HLS) encoding and packaging

Transcoding and packaging

Packaging

Live stream distribution - CDN

Audio and video consumption

Related articles