Available for data engineering & lakehouse architecture contracts

Thapelo Tshiamo
Moloto

Data Engineer  ·  Lakehouse Architect

Azure Certified Databricks Associate Microsoft Fabric Legacy Modernisation

I convert fragile legacy stacks into resilient, partition-optimised lakehouses. Specialising in SQL architecture, medallion design, and zero-loss data migrations across Retail, Insurance & Financial Services in South Africa.

// About

Data Engineering
Done Properly

Johannesburg-based Data Engineer with deep expertise in redesigning brittle legacy reporting stacks into scalable, cloud-native data platforms. I don't just move data—I architect systems that survive business growth.

My core philosophy: SQL is the fundamental unit of data truth. Every pipeline I build is grounded in proper indexing strategies, partitioning schemes, and semantic models that analysts actually understand.

Python is my orchestration layer: CLI tooling, NLP preprocessing, CV image pipelines—but the heavy lifting lives in the engine.

3+

Years Engineering
Data Systems

10M+

Rows Migrated
Zero Data Loss

4

Production
Data Products

5

Industry
Certifications

// Credentials

Verified
Certifications

DP-203

Azure Data Engineer Associate

Microsoft certified expertise in Azure data storage, processing pipelines, and security for enterprise analytics workloads.

Microsoft · Azure

Databricks

Data Engineer Associate

Certified expertise in Delta Lake, Spark medallion pipelines, and lakehouse architecture on the Databricks Unified Analytics Platform.

Databricks · Delta Lake · Spark

Microsoft Fabric

Analytics Engineer

Certified in OneLake, Lakehouse, Data Warehouse, Real-Time Analytics, and Power BI integration across the unified Fabric platform.

Microsoft · Fabric · OneLake

// Technical Stack

Architecture-First
Engineering Skills

SQL

SQL Architecture

Primary Discipline

Indexing & PartitioningExpert
Window FunctionsExpert
SCD Type 2 / MERGEExpert
Materialized ViewsExpert
Full-Text SearchAdvanced
CTEs & Recursive SQLExpert

Cloud & Platforms

Azure · Databricks · Fabric

Azure Data Lake Gen2 Azure Data Factory Azure Synapse Databricks Delta Lake Microsoft Fabric Apache Spark Z-Order Indexing Medallion Architecture Power BI OneLake dbt
Py

Python Orchestration

Automation & Intelligence Layer

CLI Tooling (Click/Typer)Advanced
PySpark / pandasExpert
Schema ValidationExpert
OpenCV / CV PreprocessingIntermediate
NLP PipelinesAdvanced

// Featured Projects

Engineering Work
Built to Scale

Each project includes a Technical Deep Dive with real SQL architecture. No tutorials, no SELECT * examples.

PROJECT 01 Retail / BI

Legacy to Modern BI

Semantic Layer Rebuild

SCD Type 2 Star Schema Materialized Views Power BI

Migrated a mid-size retail business from flat-file CSV reporting to a fully normalised Star Schema on Azure Synapse. Rebuilt the semantic layer from scratch, achieving a 72% reduction in Power BI refresh time through incremental partition-based loads and materialised pre-aggregates.

[ Data Flow ]

CSV Flat Files → Azure Blob Ingest → Staging Schema
→ SCD Type 2 Dimension Tables → Star Schema Fact Tables
→ Materialized Views → Power BI Incremental Refresh

1. SCD Type 2 MERGE with Hash-Based Change Detection

Track historical changes to dimension records without losing prior state. A filtered index on is_current=1 makes live-record lookups microsecond-fast.

-- SCD Type 2: Customer Dimension with effective dating CREATE TABLE dim_customer ( customer_sk BIGINT IDENTITY(1,1) PRIMARY KEY, customer_bk VARCHAR(50) NOT NULL, customer_name VARCHAR(200), customer_segment VARCHAR(50), region VARCHAR(100), effective_from DATE NOT NULL, effective_to DATE DEFAULT '9999-12-31', is_current BIT DEFAULT 1, row_hash BINARY(32) -- SHA2 hash for change detection ); -- Clustered index: business key + effective date (range scans) CREATE CLUSTERED INDEX cidx_cust_bk_eff ON dim_customer (customer_bk, effective_from); -- Filtered non-clustered: only live rows (current-record lookups) CREATE NONCLUSTERED INDEX nidx_cust_current ON dim_customer (customer_bk) INCLUDE (customer_sk, customer_name, region) WHERE is_current = 1; -- SCD Type 2 MERGE: expire changed rows, insert new versions MERGE INTO dim_customer AS tgt USING ( SELECT s.customer_bk, s.customer_name, s.customer_segment, s.region, HASHBYTES('SHA2_256', COALESCE(s.customer_name,'') + COALESCE(s.customer_segment,'') + COALESCE(s.region,'')) AS row_hash FROM stg_customer s ) AS src ON tgt.customer_bk = src.customer_bk AND tgt.is_current = 1 WHEN MATCHED AND tgt.row_hash <> src.row_hash THEN UPDATE SET tgt.effective_to = CAST(GETDATE() AS DATE) - 1, tgt.is_current = 0 WHEN NOT MATCHED THEN INSERT (customer_bk, customer_name, customer_segment, region, effective_from, row_hash) VALUES (src.customer_bk, src.customer_name, src.customer_segment, src.region, CAST(GETDATE() AS DATE), src.row_hash);

2. Partitioned Fact Table + Materialized View for Power BI

Monthly range partition + aligned columnstore index. A distributed materialized view pre-aggregates revenue so Power BI refreshes only the current partition slice.

-- Monthly partition scheme for incremental Power BI refresh CREATE PARTITION FUNCTION pf_monthly (DATE) AS RANGE RIGHT FOR VALUES ('2023-01-01','2023-02-01','2023-03-01', '2023-04-01','2023-05-01','2023-06-01','2023-07-01', '2023-08-01','2023-09-01','2023-10-01','2023-11-01','2023-12-01'); CREATE PARTITION SCHEME ps_monthly AS PARTITION pf_monthly ALL TO ([PRIMARY]); -- Fact table partitioned by sale_date CREATE TABLE fact_sales ( sale_sk BIGINT IDENTITY(1,1), sale_date DATE NOT NULL, customer_sk BIGINT NOT NULL, product_sk BIGINT NOT NULL, store_sk BIGINT NOT NULL, revenue DECIMAL(18,2), units_sold INT ) ON ps_monthly(sale_date); CREATE CLUSTERED COLUMNSTORE INDEX cci_fact_sales ON fact_sales ON ps_monthly(sale_date); -- Materialized view: monthly revenue pre-aggregated for DirectQuery CREATE MATERIALIZED VIEW mv_monthly_revenue WITH (DISTRIBUTION = HASH(store_sk)) AS SELECT DATEFROMPARTS(YEAR(f.sale_date), MONTH(f.sale_date), 1) AS month_start, d.region, p.category, SUM(f.revenue) AS total_revenue, SUM(f.units_sold) AS total_units, COUNT_BIG(DISTINCT f.customer_sk) AS unique_customers FROM fact_sales f JOIN dim_customer d ON f.customer_sk = d.customer_sk AND d.is_current = 1 JOIN dim_product p ON f.product_sk = p.product_sk GROUP BY DATEFROMPARTS(YEAR(f.sale_date),MONTH(f.sale_date),1), d.region, p.category;
PROJECT 02 Insurance / Lakehouse

Insurance-Ghost

Scalable Medallion Lakehouse

Medallion Architecture Delta Lake Window Functions Z-Order Indexing

Designed a full Bronze / Silver / Gold Medallion Architecture on Databricks + Azure Data Lake Gen2 for a high-volume synthetic insurance dataset. The Gold layer exposes risk-scoring analytics via RANK and LAG window functions, enabling claims pattern detection 12× faster than the prior flat-table approach.

[ Medallion Architecture ]

Raw JSON/CSV → Bronze (raw, full history)
Silver (cleaned: Policy + Claims + Customer joins)
Gold (risk aggregates, RANK windows, serving layer)
Partitioned by Policy_Year  ·  Z-Order on Claim_ID + Policy_Number

1. Delta Lake: Policy_Year Partitioning + Z-Order on Claim_ID

Horizontal partition by year eliminates full-table scans. Z-Order on (Claim_ID, Policy_Number) co-locates correlated rows so multi-table joins hit minimal files.

# PySpark: write Gold layer partitioned by Policy_Year from pyspark.sql import functions as F gold_df = ( silver_claims_df .withColumn("Policy_Year", F.year("policy_start_date")) .withColumn("claim_amount", F.coalesce(F.col("claim_amount"), F.lit(0))) ) (gold_df.write .format("delta") .mode("overwrite") .option("overwriteSchema", "true") .partitionBy("Policy_Year") .save("abfss://gold@datalake.dfs.core.windows.net/insurance/claims") ) -- Z-Order: cluster Claim_ID + Policy_Number within each partition OPTIMIZE gold.insurance_claims ZORDER BY (Claim_ID, Policy_Number); -- Retain 7-day time-travel, purge stale files VACUUM gold.insurance_claims RETAIN 168 HOURS;

2. Risk Pattern Detection — RANK, LAG & Running Totals

High-frequency claimant detection using LAG (days-between-claims), cumulative exposure totals, and per-year claim ranking to feed the actuarial risk model.

WITH claim_timeline AS ( SELECT c.customer_id, cl.claim_date, cl.claim_amount, cl.claim_type, DATEDIFF('day', LAG(cl.claim_date) OVER (PARTITION BY c.customer_id ORDER BY cl.claim_date), cl.claim_date ) AS days_since_prev_claim, SUM(cl.claim_amount) OVER ( PARTITION BY c.customer_id ORDER BY cl.claim_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_exposure, DENSE_RANK() OVER ( PARTITION BY c.customer_id, YEAR(cl.claim_date) ORDER BY cl.claim_amount DESC ) AS claim_rank_ytd FROM fact_claims cl JOIN dim_customer c ON cl.customer_id = c.customer_id JOIN dim_policy p ON cl.policy_number = p.policy_number JOIN dim_product pr ON p.product_code = pr.product_code WHERE p.policy_year = YEAR(CURRENT_DATE()) - 1 ), risk_flags AS ( SELECT *, CASE WHEN days_since_prev_claim < 30 THEN 'HIGH_FREQUENCY' WHEN cumulative_exposure > 500000 THEN 'HIGH_EXPOSURE' WHEN claim_rank_ytd = 1 AND claim_amount > 100000 THEN 'LARGE_SINGLE_CLAIM' ELSE 'NORMAL' END AS risk_category FROM claim_timeline ) SELECT * FROM risk_flags WHERE risk_category <> 'NORMAL' ORDER BY cumulative_exposure DESC;
PROJECT 03 Migration / Python + SQL

Zenith-Migrator

Legacy SQL Server → Azure Data Lake

Python CLI Bulk Insert Checksum Audit Clustered Index

A production-grade Python CLI that orchestrates the full ETL lifecycle from legacy SQL Server 2012 to Azure Data Lake Gen2. Python handles pre-flight schema validation and checksum generation; SQL handles bulk loads and post-load clustered index rebuilds. Migrated 1.4 million rows with automated 0% data-loss verification.

1. Python Pre-Flight Validation CLI

Before a row moves, the migrator validates schemas match, computes source checksums, and aborts on mismatch — preventing silent data corruption.

# zenith_migrator/cli.py import click, pyodbc from azure.storage.filedatalake import DataLakeServiceClient @click.command() @click.option("--table", required=True) @click.option("--batch", default=50000) @click.option("--dry-run", is_flag=True) def migrate(table, batch, dry_run): """Migrate a SQL Server table to ADLS with checksum audit.""" schema_diff = compare_schemas(get_source_schema(table), get_target_schema(table)) if schema_diff: click.echo(f"Schema mismatch: {schema_diff}", err=True) raise click.Abort() src = compute_checksum(table, source=True) click.echo(f"Source: {src['row_count']:,} rows | hash={src['col_hash']}") if not dry_run: bulk_migrate(table, batch_size=batch) tgt = compute_checksum(table, source=False) if src['row_count'] != tgt['row_count']: raise RuntimeError("DATA LOSS DETECTED") click.echo(f"Verified: {tgt['row_count']:,} rows | 0% data loss") def compute_checksum(table, source=True): query = f""" SELECT COUNT(*) AS row_count, CHECKSUM_AGG(CHECKSUM(*)) AS col_hash FROM [{table}] WITH (NOLOCK) """ with (get_src_conn() if source else get_tgt_conn()) as conn: r = conn.execute(query).fetchone() return {"row_count": r[0], "col_hash": r[1]}

2. Bulk Load + Post-Load Clustered Index

Indexes are intentionally dropped before bulk insert (minimally logged) and rebuilt after — the correct sequence for maximum throughput on large row sets.

-- Drop index before bulk load (enables minimal logging) IF EXISTS (SELECT 1 FROM sys.indexes WHERE object_id = OBJECT_ID('dbo.migration_staging') AND name = 'cidx_staging_pk') DROP INDEX cidx_staging_pk ON dbo.migration_staging; -- Bulk load with TABLOCK for minimal logging BULK INSERT dbo.migration_staging FROM '\\staging\export\orders_001.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW = 2, BATCHSIZE = 50000, TABLOCK, CHECK_CONSTRAINTS, FIRE_TRIGGERS ); -- Rebuild clustered index after all batches complete CREATE CLUSTERED INDEX cidx_staging_pk ON dbo.migration_staging (order_id) WITH (SORT_IN_TEMPDB = ON, FILLFACTOR = 90, ONLINE = OFF); -- Non-clustered covering index for query patterns CREATE NONCLUSTERED INDEX nidx_staging_cust_date ON dbo.migration_staging (customer_id, order_date) INCLUDE (order_total, status) WITH (FILLFACTOR = 85, ONLINE = OFF); -- Row-count integrity guard DECLARE @src BIGINT = 1412847; DECLARE @tgt BIGINT = (SELECT COUNT_BIG(*) FROM dbo.migration_staging WITH(NOLOCK)); IF @src <> @tgt THROW 50001, 'DATA LOSS: row count mismatch post-migration.', 1; PRINT 'Migration audit PASSED — zero data loss confirmed.';
PROJECT 04 FinTech / Time-Series

Chaos Arbitrageur

Trading Signal Engine

Time-Series SQL Sliding Windows Composite Index VWAP / Volatility

Real-time financial event processing pipeline. Ingests tick-level market data and computes moving averages, volatility metrics, and cross-asset signals using sliding window SQL. A composite index on (Ticker, Timestamp) enables sub-second signal retrieval across millions of daily events.

[ Architecture Diagram Placeholder — Chaos Arbitrageur ]

Market Feed API → Kafka Ingest → Bronze (raw ticks)
→ Silver (OHLCV normalised) → Gold (signals, moving avgs, VWAP)
→ Composite Index (Ticker, Timestamp) → Sub-second Query Layer
→ Signal Dashboard / Webhook Alerts

1. Composite Index Design — Sub-Second Signal Retrieval

All query patterns are ticker-first with a time range. The composite clustered index on (ticker, trade_ts) guarantees index seeks, not scans, at any scale.

-- Partitioned tick table, composite cluster key CREATE TABLE market_ticks ( tick_id BIGINT IDENTITY(1,1), ticker VARCHAR(10) NOT NULL, trade_ts DATETIME2(6) NOT NULL, open_price DECIMAL(18,6), high_price DECIMAL(18,6), low_price DECIMAL(18,6), close_price DECIMAL(18,6), volume BIGINT, trade_date DATE NOT NULL ) ON ps_trade_daily(trade_date); -- Clustered composite: ticker first, then timestamp CREATE CLUSTERED INDEX cidx_ticks_ticker_ts ON market_ticks (ticker ASC, trade_ts ASC); -- Covering NCCI: OHLCV on the same key avoids key lookups CREATE NONCLUSTERED INDEX nidx_ticks_ohlcv ON market_ticks (ticker, trade_ts) INCLUDE (open_price, high_price, low_price, close_price, volume) WITH (FILLFACTOR = 80);

2. Sliding Window — SMA, Volatility & VWAP

SMA-5, SMA-30, rolling standard deviation (volatility), VWAP, and Golden/Death Cross detection all in a single CTE chain — no application-layer computation needed.

WITH price_base AS ( SELECT ticker, trade_ts, trade_date, close_price, volume, AVG(close_price) OVER ( PARTITION BY ticker ORDER BY trade_ts ROWS BETWEEN 4 PRECEDING AND CURRENT ROW ) AS sma_5, AVG(close_price) OVER ( PARTITION BY ticker ORDER BY trade_ts ROWS BETWEEN 29 PRECEDING AND CURRENT ROW ) AS sma_30, STDEVP(close_price) OVER ( PARTITION BY ticker ORDER BY trade_ts ROWS BETWEEN 29 PRECEDING AND CURRENT ROW ) AS volatility_30, SUM(close_price * volume) OVER ( PARTITION BY ticker, trade_date ORDER BY trade_ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) / NULLIF(SUM(volume) OVER ( PARTITION BY ticker, trade_date ORDER BY trade_ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 0 ) AS vwap, (close_price - LAG(close_price,1) OVER (PARTITION BY ticker ORDER BY trade_ts)) / NULLIF(LAG(close_price,1) OVER (PARTITION BY ticker ORDER BY trade_ts),0) AS pct_return FROM market_ticks WHERE trade_date >= DATEADD(day,-90,CAST(GETDATE() AS DATE)) ), signals AS ( SELECT *, CASE WHEN sma_5 > sma_30 AND LAG(sma_5) OVER (PARTITION BY ticker ORDER BY trade_ts) <= LAG(sma_30) OVER (PARTITION BY ticker ORDER BY trade_ts) THEN 'GOLDEN_CROSS' WHEN sma_5 < sma_30 AND LAG(sma_5) OVER (PARTITION BY ticker ORDER BY trade_ts) >= LAG(sma_30) OVER (PARTITION BY ticker ORDER BY trade_ts) THEN 'DEATH_CROSS' END AS signal FROM price_base ) SELECT ticker, trade_ts, close_price, ROUND(sma_5,4) AS sma_5, ROUND(sma_30,4) AS sma_30, ROUND(volatility_30,6) AS volatility, ROUND(vwap,4) AS vwap, ROUND(pct_return*100,4) AS pct_return, signal FROM signals WHERE signal IS NOT NULL ORDER BY trade_ts DESC;

// Consulting Experience

Enterprise Engagements
Real Business Impact

Described using industry-level terminology to respect client NDAs. Sectors and outcomes are accurate.

Decision Inc.

Analytics Consulting Firm · Johannesburg

Retail / FMCG

Semantic Layer Redesign

Rebuilt a legacy SSAS cube into a modern star schema on Azure Synapse for a large retail client. Introduced SCD Type 2 across all slowly-changing dimensions, eliminating aggregation inconsistencies in monthly management reporting packs.

Power BI Performance Engineering

Applied partition-aligned materialized views and incremental refresh policies, reducing 30-minute daily refresh cycles to under 4 minutes — a 7.5× improvement by restricting refreshes to the most recently modified partition.

ETL Governance Framework

Built row-level audit tables, CHECKSUM_AGG validation, and data lineage tagging across 14 ADF pipelines. Zero pipeline failures in the 3 months post-implementation.

Azure Synapse Power BI ADF SQL Server SCD Type 2

PBT Group

Data & Analytics Specialists · SA

Insurance / FSI

Lakehouse Migration (Insurance)

Led the design of a Medallion Architecture migration for an insurance client, moving from on-prem Oracle to Azure Data Lake Gen2 + Databricks. Defined Bronze/Silver/Gold zone schemas and all ingestion patterns.

Claims Analytics Engineering

Implemented window-function risk scoring (RANK, LEAD/LAG) in Databricks SQL to support an actuarial team. Reduced time-to-insight from weekly batch reports to near-real-time dashboards.

Data Quality Framework

Built a Python profiling & validation suite (Great Expectations), enforcing schema contracts at the Silver layer boundary. Caught 23 upstream schema breaks before they propagated to Gold.

Databricks Delta Lake Python Oracle → Azure Great Expectations

// High-Demand Niche · South Africa

Legacy Modernisation Specialist

South African banking, retail, and insurance enterprises are sitting on decades of technical debt: Oracle environments, SSAS cubes, flat-file ETL, and fragile SSIS packages. There is a growing — and largely unmet — demand for engineers who can speak both the legacy language and the cloud future.

I bridge that gap: legacy stack fluency for safe migrations, cloud architecture knowledge for building something worth migrating to. That combination — legacy fluency + modern lakehouse architecture — is the niche.

// Let's Build

Ready to Architect
Your Data Platform?

Whether you need a lakehouse built from scratch, a legacy migration executed cleanly, or a senior engineer to close a technical gap — let's talk.

Based in Johannesburg, SA  ·  Available for remote & on-site contracts