What’s A Data Platform Anyway?

What’s A Data Platform Anyway?

And Why Your Startup Needs Or Uses One

Photo by [hoch3media](https://unsplash.com/@hoch3media?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/platform?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)

By 2014, Uber had generated just over a few terabytes of data. This limited data was spread across few traditional databases like MySQL and PostgreSQL. Engineers could access the databases individually and write scripts to combine the data sources.

Services interacting with one or more databases | Image by author

Since the data was directly accessed from the database, it was super fast with sub-minute access speeds.

But there was a problem — Data was scattered across several databases. In many cases, different services interacted with different databases.

Why was this a problem?

Uber as we know at its roots is a data-driven company — predicting demand during high traffic events and adjusting pricing, for example, is one of its features. As Uber expanded, the data it produced grew — enormously!.

But since the data existed in disconnected, decentralized silos it became challenging for analysts to execute tasks that demanded data from multiple sources. This was just one of the many quandaries they faced while working with disjointed data.

Thus the need to access and analyze all the data in one place becomes a goal of prime importance. What followed was a generation of Data Platforms which made Uber uber.

uber — /ˈuːbə/ *— *denoting an outstanding or supreme example of a particular kind of person or thing.

When a company matures and the data it produces grows and having a consolidated data repository becomes increasingly critical for analytics, business intelligence, and AI systems. Uber could still have continued with disjoint silos and it would never mature to what it is today. Unifying different sources helped it to unhitch the value of data and what made this unification possible was a Data Platform.

Data Platform is a technology that allows data to be collected, transformed, unified, and be delivered to users, applications or is used for other business intelligence purposes like running recommendation engines.

In more formal terms, it enables data access, governance, delivery, and security. Data Platform (DP) is a result of data inconsistencies and redundancies within an organization. It solves this and allows us to orchestrate, transform and serve data to end-users (for example, data scientists).

For example, consider an organization that has multiple apps with separate databases. There might be copies of the same data across these separate databases. This adds to the problem of superfluities which then brings in the problem of scalability — which is a big No-No in the tech world!

How can a company focus on evolve when engineers are focused on the complex interactions with databases instead of crafting new products?

Now that we have an idea of the data platform, let’s look at the general top-level data architecture of the DP used by streaming services.

Data Architecture of a Streaming Data Platform | Image by Author

*Bifrost is one such example of a Data Platform used by Disney+Hotstar (an OTT Streaming service)*

Instead of having multiple repositories for data generated by different clients and their different use cases, Bifrost has a single entry point. All the clients and all the microservices have a single Kafka-based ingestion point.

The data is first aggregated, unified, and standardized before performing a logic operation and sending the data to a Data Lake or Data Warehouse.

  • A Data Warehouse (DW) pulls structured data and runs on a relational database*. Example Service — Snowflake.*

  • A Data Lake (DL) pulls and stores data in its original format with minimal unification or transformation. Historical data is stored in DL. Example service — Amazon S3.

The data in DL and DW is then used as input for different products such as recommendation and ad targeting systems. The structured data in DW is used essentially by analysts to perform BI tasks.

The DPs discussed above are examples of Modern Data Platforms. Few other kinds of platforms are:

  • Enterprise Data Platform — Antecedent of MDP and deals with basic storage/service.

  • Customer Data Platform — Majorly deals with customer data and is used to build user profiles for future recommendations.

  • Big Data Analytics Platform — Specialized data platform for data analytics purposes.

  • Cloud Data Platform — Entirely built with cloud computing technologies and data stores.

As you might have noticed, there is no fine line between these DPs. More often than not, you’ll find enterprises using a combination of these basic DPs to create a DP of their own.

If you are curious as to what technologies are used in each stage of a Data Platform, the diagram below lists the common tech stack for each stage.

[Elements of Data Platform](https://towardsdatascience.com/the-building-blocks-of-a-modern-data-platform-92e46061165) | Source [Atlan](https://atlan.com/)

To sum up, a Data Platform is a combination of interoperable, scalable, and replaceable technologies working together to deliver an enterprise’s overall data needs.

What’s makes a good Data Platform?

Now that we have some understanding of what DP is and why it is used, let’s wrap up this article by looking into what makes a DP, a good DP.

  • Availability — The data platform should be highly available for clients and end-point users such as data analysts and scientists. This becomes even more important in real-time applications.

  • Governance — When data comes into the picture there are strict policies to be followed by an enterprise. A good data platform must enable ease of enabling or updating data governance strategies.

  • Security — Who has access to the data and how many data access points can be made available should be easily configurable with the Data Platform. Many services today use a Single Sign-On (SSO) type of authentication system to provide single-point access to all who can access the data.

  • Centralization — A good data platform must support all types of sources like MySQL, Cassandra, MongoDB, etc and help in bridging silos.

  • Delivery — It should allow and enable functions like scheduling and proactive alerts.

Bonus

We started out with how Pre-2014-Uber had silos which made scalability difficult. It took three major iterations for Uber to reach the current state-of-the-art Data Platform. In this process, it gave birth to a Hudi, an open-source Spark library, which handles Uber’s data with low latency and high query speeds. For curious minds, the data architecture of Uber’s current Data Platform can be explored here.

This article is a compilation of all major points from my recent exploration of Data Platforms. If you enjoyed this and want to know more about Data Platforms do continue with further readings.

Thanks for reading and do leave your thought below.

Further Readings

What is a Data Platform? Definition & Benefits (looker.com)

What is a data platform? | Emerging Technologies | Splunk

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency | Uber Engineering Blog (Must read!)

Data Democratisation @ Hotstar. Introduction | by Jayesh Sidhwani | Disney+ Hotstar

What is a Data Platform? — YouTube