In my last post, I touched on the increasing cost of storage as the total amount of data we store has seen an explosive growth.
One way to offset this cost is to monetize data by extracting insights to increase revenue and profitability. Historically, the approach to extract insights from data was to build data warehouses to house large volumes of data from a variety of sources, scrub it for veracity and mine it for insights leading to revenue opportunities.
By definition, data warehouses catered mostly to structured data, defined, collected and stored with structured governance. Designed to meet specific goals, data warehouses could not cope fast when the goals changed. In addition traditional methods of data warehouse creation can neither handle unstructured data nor the velocity at which data is being created today. (By some accounts, 80 percent of data today is unstructured or semi-structured.)
With the speed data is created and consumed today, we need more nimble solutions like data lakes. A data lake is a data repository that allows storage at any scale of structured, unstructured and semi-structured data.
At first glance, it might look like data lakes are the holy grail for all requirements related to analysis of data, mining for insights, and feeding hungry data scientists trying to train AI/ML models. On the other hand, data lakes’ ability to store structured, semi-structured and unstructured data could quickly become dumping grounds for all data.
While AWS, Snowflake and other similar companies would love it if their customers loaded all their data into these platforms, it's like Hotel California: you can check in, but you can never leave.
Data lakes still need to be planned carefully with defined storage structures to meet performance requirements. What data formats should you use: CSV to make it universally accessible and forever readable, or Parquet to support storage compression and faster querying for analytics? How about ORC to support better compression and predicate pushdown? Or AVRO? Or...?
You must also take into consideration the data set sizes to optimize partitions, choose the attributes with the most optimal cardinality for partitioning, and so on.
All this leads to a couple questions: Will data lakes pay off in the long run? Will they solve all the problems they were designed for and cater to new goals as they emerge?
Data federation or virtualization could be an option but has its own challenges as data access controls are likely siloed, managing security requirements, data at source might not have historical data, and performance discrepancies between the various sources, etc.
While companies have many options for storing their data, their very specific requirements will dictate the best solution.
None of these solutions seem ideal for supporting exploratory analytics efforts. Often, we just need access to data from various sources for short periods of time to prove or disprove hypotheses and extract insights that could support short- or long-term planning efforts.
For those reasons, what we need is a "purpose-defined" "mini" data lake, or a “data pond” that is:
- Short term, i.e. created for the duration it is needed
- With subsets of data collected directly from the source(s)
- Instantiated as a queryable data store
- Optimized for a specific purpose
- Destroyed after the goals have been met
At the least we prevent one more copy of the data. At best, we achieve our stated goals.
Own is growing rapidly and currently hiring multiple positions, including software engineers. Watch the video to hear more from Vasu about working at Own.