1. Relationship Between Data Lakehouse and Data Warehouse
-
Data Warehouse:
- Traditional structured storage optimized for SQL analytics.
- Uses Extract, Transform, Load (ETL) to process structured data.
- Limited support for unstructured/semi-structured data.
- Examples: Snowflake, Redshift, BigQuery.
-
Data Lakehouse:
- Combines data lake (flexible storage) + data warehouse (performance & governance).
- Supports structured, semi-structured (JSON, Parquet), and unstructured data (images, logs).
- Uses Delta Lake for ACID transactions, schema enforcement, and time travel.
- Enables BI, ML, and streaming on a single platform.
| Feature | Data Warehouse | Data Lakehouse |
|---|---|---|
| Data Types | Structured only | Structured, semi-structured, unstructured |
| Cost | Expensive storage | Cost-effective (object storage) |
| Performance | Optimized for SQL | Optimized for SQL + ML + Streaming |
| Schema | Schema-on-write | Schema-on-read & schema enforcement |
2. Improvement in Data Quality in Lakehouse Over Data Lake
-
Data Lakes (Traditional Issues):
- No schema enforcement → “Data Swamp” (poor quality).
- No ACID transactions → Dirty reads/writes.
- Limited metadata management → Hard to govern.
-
Lakehouse Improvements (via Delta Lake):
- ACID Transactions: Ensures consistency (e.g., concurrent writes).
- Schema Enforcement: Prevents bad data ingestion.
- Time Travel: Rollback to previous versions (e.g.,
RESTORE TABLE). - Data Quality Checks: Expectations (e.g.,
CHECK constraints). - Unified Governance: Fine-grained access control (Row/Column-level).
3. Silver vs. Gold Tables & Workload Sources
- Medallion Architecture:
- Bronze (Raw): Raw ingested data (e.g., JSON logs, IoT streams).
- Silver (Cleaned): Validated, deduplicated, enriched data.
- Gold (Aggregated): Business-level aggregates (e.g., KPIs, reports).
| Table Type | Use Case | Source | Example |
|---|---|---|---|
| Bronze | Raw ingestion | Kafka, Files | sales_raw |
| Silver | Cleaning & Enrichment | Bronze | sales_cleaned |
| Gold | Reporting & Aggregation | Silver | sales_monthly_metrics |
- Bronze as Source: Stream processing (e.g., Spark Structured Streaming).
- Gold as Source: BI dashboards (Tableau, Power BI).
4. Databricks Platform Architecture (Control vs. Data Plane)
-
Control Plane:
- Managed by Databricks (in their cloud).
- Includes:
- UI/API management.
- Job scheduler.
- User authentication (via Databricks account).
-
Data Plane:
- Runs in customer’s cloud account (AWS/Azure/GCP).
- Includes:
- Compute (Clusters, SQL Warehouses).
- Storage (DBFS, External Blob Storage).
- Delta Lake tables.
5. All-Purpose vs. Jobs Clusters
| Feature | All-Purpose Cluster | Jobs Cluster |
|---|---|---|
| Use Case | Interactive (Notebooks) | Scheduled/Automated Jobs |
| Cost | More expensive (always-on) | Cheaper (terminates after job) |
| Access | Multiple users | Single job execution |
| Management | Manual start/stop | Auto-terminates |
- All-Purpose: Data scientists exploring data.
- Jobs Cluster: Nightly ETL pipeline.
6. Cluster Software Versioning (Databricks Runtime)
- Databricks Runtime (DBR): Optimized Spark + extra features (Delta, ML).
- Versions:
DBR 10.4 LTS,DBR 11.3(latest). - LTS: Long-Term Support (stable).
- GPU/ML Runtimes: Pre-installed libraries (TensorFlow, PyTorch).
- Versions:
7. Filtering Accessible Clusters
- Methods:
- UI:
Compute→ Filter by “Can Attach To”. - API:
GET /api/2.0/clusters/list→ Checkcan_manageflag.
- UI:
8. Terminating a Cluster & Impact
-
Termination:
- Manual: UI/API (
DELETE /api/2.0/clusters/terminate). - Auto-Termination: After idle time (configurable).
- Manual: UI/API (
-
Impact:
- Running Jobs Fail.
- Notebook Sessions Disconnected.
- Storage Persists (DBFS/Delta unaffected).
9. When to Restart a Cluster
- Scenarios:
- Library install requires restart.
- Configuration changes (e.g., autoscaling).
- Cluster becomes unresponsive.
10. Using Multiple Languages in a Notebook
- Magic Commands:
%python,%sql,%scala,%r.- Example:
11. Running a Notebook from Another Notebook
%runCommand:dbutils.notebook.run()(for jobs):
12. Sharing Notebooks
- Methods:
- Export:
.dbcor.ipynbformat. - Permissions:
Can View,Can Edit,Can Run. - Workspace Access Control: Admin-managed.
- Export:
13. Databricks Repos for CI/CD
- Features:
- Git integration (GitHub, GitLab, Bitbucket).
- Sync notebooks with remote repos.
- Branching/Merging directly in UI.
- Clone repo →
git checkout feature-branch. - Develop → Commit → Push.
- Open PR → Merge to
main.
14. Git Operations in Databricks Repos
- Supported:
git clone,pull,push,commit,branch.
- Not Supported:
git rebase,submodules.
15. Notebook Version Control Limitations (vs. Repos)
- Notebook History:
- Only saves local revisions (no Git integration).
- Repos Advantage:
- Full Git history.
- Collaboration (PRs, branching).