Technical

Big Data Management with an Intelligent Data Lake

Edwin Fig

Creative life-long learner and tinkerer with over 20 years of experience helping people with technology. INFP / Pioneer / Provider

More posts by Edwin Fig.

Edwin Fig

25 Jun 2019 • 3 min read

Big Data Management with an Intelligent Data Lake

Notes on Informatica Big Data Management with Intelligent Data Lake Deep Dive and Demo

Informatica Big Data Management (BDM)

Secure/Govern
Secure & Govern, Connect, Ingest, Parse, Extract, Profile, Process, Perform
Polyglot Computing
Informatica Native, Database Pushdown, Hadoop Pushdown, Map Reduce, Tez, Blaze

New trends in big data management:

Collect Everything
New User Class
Data Scientists
Flexible Access
Batch, Real-Time, Interactive, On-Premise, Cloud

Challenges Faced by Organization

IT
Can’t cope with growing business demands
No visibility into what the business is doing with the data
Losing the ability to govern and manage data as an asset
Data Analysts
Frustrated by slow response from IT due to long backlog
Can’t easily find trusted data
Limited access to the data
No way to collaborate, share, and update curated datasets

“If you don’t understand the data assets, how do you use them?”

“It’s frustrating to see a really promising dataset only to find out it’s really bad.”

“I need metadata for those data assets, to provide visibility into the data lineage.”

“Too many data silows make it impossible to know what data can be trusted.”

“Prepping & cleaning the data takes us 2-3 weeks, sometimes longer.”

Informatica’s “Intelligent Data Lake”

nformatica’s “Smart Executor” will determine the best execution engine at runtime (Hive, Spark, etc.)

Engine then sends it to the resource manager (YARN) and it goes from there…

SQL to Informatica mapping functionalities

Complex SQL with multiple JOINs can be converted to mappings (for ETL, it’s an Informatica thing)

Intelligent Data Platform – Metadata

Enterprise Information Catalog module

Analyze and understand large volumes of data in an enterprise environment
Information about data domains, data relationships, etc.
“Unified view into enterprise information assets”
Business-user oriented solution
Semantic search with dynamic facets
Data lineage
Change impact
Relationships discovery
High level data profiling
Data domains
Custom attributes with business classificaitons
Broad metadata source connectectivity
Big data scale

Intelligent Data Lake Demo

Centralized repo of large volumes of structured and unstructured data (a Hadoop cluster)
Analysts can use a data lake for self-service to discover and prepare data assets with little or no training
Data lake terminology
Data Asset – data you work with as a unit
Project – a container that stores assets and worksheets
Recipe – The steps taken to prepare data in a worksheet
Data prep – the process of combining, cleansing, transforming, and structuring data
Data publication – making the data available in the data lake
For IT - Data Discovery & Analysis Process
Analyst searches for assets with “Live Data Map”
Recommendation engine provides suggestions
Data Prep application sets up data sets
These “recipes” can be operationalized/automated
Publish!
Uploading data
From local drive
“Trending Searches”, Recently Viewed, Asset Types

“Lineage” will tell you where it all comes from:

“Relationships” tell you what other tables have a join id

“Recommendations” – Based on data assets added to the project

Blend data by combining worksheets

Add formulas, create the new data sets and add it to the project. All of the steps are visible and saved.

From here, this mapping can be scheduled to run automatically

Real-time integration with Informatica Cloud REST API (with Salesforce)

Before

Data sync tasks runs and pulls changes in to systems

After

Instead of relying on manual/scheduled execution, it can fire based on an event
Create an Apex trigger.
Create an apex class that will fire with the trigger
Realtime update!