Thapelo Moloto — Senior Data Engineer

// About

Data Engineering
Done Properly

Johannesburg-based Data Engineer with deep expertise in redesigning brittle legacy reporting stacks into scalable, cloud-native data platforms. I don't just move data—I architect systems that survive business growth.

My core philosophy: SQL is the fundamental unit of data truth. Every pipeline I build is grounded in proper indexing strategies, partitioning schemes, and semantic models that analysts actually understand.

Python is my orchestration layer: CLI tooling, NLP preprocessing, CV image pipelines—but the heavy lifting lives in the engine.

3+

Years Engineering
Data Systems

10M+

Rows Migrated
Zero Data Loss

4

Production
Data Products

5

Industry
Certifications

// Credentials

Verified
Certifications

DP-203

Azure Data Engineer Associate

Microsoft certified expertise in Azure data storage, processing pipelines, and security for enterprise analytics workloads.

Microsoft · Azure

Databricks

Data Engineer Associate

Certified expertise in Delta Lake, Spark medallion pipelines, and lakehouse architecture on the Databricks Unified Analytics Platform.

Databricks · Delta Lake · Spark

Microsoft Fabric

Analytics Engineer

Certified in OneLake, Lakehouse, Data Warehouse, Real-Time Analytics, and Power BI integration across the unified Fabric platform.

Microsoft · Fabric · OneLake

// Technical Stack

Architecture-First
Engineering Skills

SQL

SQL Architecture

Primary Discipline

Indexing & PartitioningExpert

Window FunctionsExpert

SCD Type 2 / MERGEExpert

Materialized ViewsExpert

Full-Text SearchAdvanced

CTEs & Recursive SQLExpert

☁

Cloud & Platforms

Azure · Databricks · Fabric

Azure Data Lake Gen2 Azure Data Factory Azure Synapse Databricks Delta Lake Microsoft Fabric Apache Spark Z-Order Indexing Medallion Architecture Power BI OneLake dbt

Py

Python Orchestration

Automation & Intelligence Layer

CLI Tooling (Click/Typer)Advanced

PySpark / pandasExpert

Schema ValidationExpert

OpenCV / CV PreprocessingIntermediate

NLP PipelinesAdvanced

// Featured Projects

Engineering Work
Built to Scale

Each project includes a Technical Deep Dive with real SQL architecture. No tutorials, no SELECT * examples.

PROJECT 01 Retail / BI

Legacy to Modern BI

Semantic Layer Rebuild

SCD Type 2 Star Schema Materialized Views Power BI

Migrated a mid-size retail business from flat-file CSV reporting to a fully normalised Star Schema on Azure Synapse. Rebuilt the semantic layer from scratch, achieving a 72% reduction in Power BI refresh time through incremental partition-based loads and materialised pre-aggregates.

[ Data Flow ]

CSV Flat Files → Azure Blob Ingest → Staging Schema
→ SCD Type 2 Dimension Tables → Star Schema Fact Tables
→ Materialized Views → Power BI Incremental Refresh

1. SCD Type 2 MERGE with Hash-Based Change Detection

Track historical changes to dimension records without losing prior state. A filtered index on is_current=1 makes live-record lookups microsecond-fast.

-- SCD Type 2: Customer Dimension with effective dating
CREATE TABLE dim_customer (
    customer_sk      BIGINT      IDENTITY(1,1) PRIMARY KEY,
    customer_bk      VARCHAR(50) NOT NULL,
    customer_name    VARCHAR(200),
    customer_segment VARCHAR(50),
    region           VARCHAR(100),
    effective_from   DATE        NOT NULL,
    effective_to     DATE        DEFAULT '9999-12-31',
    is_current       BIT         DEFAULT 1,
    row_hash         BINARY(32)   -- SHA2 hash for change detection
);

-- Clustered index: business key + effective date (range scans)
CREATE CLUSTERED INDEX cidx_cust_bk_eff
    ON dim_customer (customer_bk, effective_from);

-- Filtered non-clustered: only live rows (current-record lookups)
CREATE NONCLUSTERED INDEX nidx_cust_current
    ON dim_customer (customer_bk)
    INCLUDE (customer_sk, customer_name, region)
    WHERE is_current = 1;

-- SCD Type 2 MERGE: expire changed rows, insert new versions
MERGE INTO dim_customer AS tgt
USING (
    SELECT s.customer_bk, s.customer_name, s.customer_segment, s.region,
           HASHBYTES('SHA2_256',
               COALESCE(s.customer_name,'') + COALESCE(s.customer_segment,'')
               + COALESCE(s.region,'')) AS row_hash
    FROM stg_customer s
) AS src ON tgt.customer_bk = src.customer_bk AND tgt.is_current = 1
WHEN MATCHED AND tgt.row_hash <> src.row_hash THEN
    UPDATE SET tgt.effective_to = CAST(GETDATE() AS DATE) - 1, tgt.is_current = 0
WHEN NOT MATCHED THEN
    INSERT (customer_bk, customer_name, customer_segment, region,
             effective_from, row_hash)
    VALUES (src.customer_bk, src.customer_name, src.customer_segment,
             src.region, CAST(GETDATE() AS DATE), src.row_hash);

2. Partitioned Fact Table + Materialized View for Power BI

Monthly range partition + aligned columnstore index. A distributed materialized view pre-aggregates revenue so Power BI refreshes only the current partition slice.

-- Monthly partition scheme for incremental Power BI refresh
CREATE PARTITION FUNCTION pf_monthly (DATE)
AS RANGE RIGHT FOR VALUES ('2023-01-01','2023-02-01','2023-03-01',
    '2023-04-01','2023-05-01','2023-06-01','2023-07-01',
    '2023-08-01','2023-09-01','2023-10-01','2023-11-01','2023-12-01');

CREATE PARTITION SCHEME ps_monthly
AS PARTITION pf_monthly ALL TO ([PRIMARY]);

-- Fact table partitioned by sale_date
CREATE TABLE fact_sales (
    sale_sk     BIGINT         IDENTITY(1,1),
    sale_date   DATE           NOT NULL,
    customer_sk BIGINT         NOT NULL,
    product_sk  BIGINT         NOT NULL,
    store_sk    BIGINT         NOT NULL,
    revenue     DECIMAL(18,2),
    units_sold  INT
) ON ps_monthly(sale_date);

CREATE CLUSTERED COLUMNSTORE INDEX cci_fact_sales
    ON fact_sales ON ps_monthly(sale_date);

-- Materialized view: monthly revenue pre-aggregated for DirectQuery
CREATE MATERIALIZED VIEW mv_monthly_revenue
WITH (DISTRIBUTION = HASH(store_sk)) AS
SELECT
    DATEFROMPARTS(YEAR(f.sale_date), MONTH(f.sale_date), 1) AS month_start,
    d.region, p.category,
    SUM(f.revenue)                       AS total_revenue,
    SUM(f.units_sold)                    AS total_units,
    COUNT_BIG(DISTINCT f.customer_sk)    AS unique_customers
FROM  fact_sales   f
JOIN  dim_customer d ON f.customer_sk = d.customer_sk AND d.is_current = 1
JOIN  dim_product  p ON f.product_sk  = p.product_sk
GROUP BY DATEFROMPARTS(YEAR(f.sale_date),MONTH(f.sale_date),1), d.region, p.category;

PROJECT 02 Insurance / Lakehouse

Insurance-Ghost

Scalable Medallion Lakehouse

Medallion Architecture Delta Lake Window Functions Z-Order Indexing

Designed a full Bronze / Silver / Gold Medallion Architecture on Databricks + Azure Data Lake Gen2 for a high-volume synthetic insurance dataset. The Gold layer exposes risk-scoring analytics via RANK and LAG window functions, enabling claims pattern detection 12× faster than the prior flat-table approach.

[ Medallion Architecture ]

Raw JSON/CSV → Bronze (raw, full history)
→ Silver (cleaned: Policy + Claims + Customer joins)
→ Gold (risk aggregates, RANK windows, serving layer)
Partitioned by Policy_Year · Z-Order on Claim_ID + Policy_Number

1. Delta Lake: Policy_Year Partitioning + Z-Order on Claim_ID

Horizontal partition by year eliminates full-table scans. Z-Order on (Claim_ID, Policy_Number) co-locates correlated rows so multi-table joins hit minimal files.

# PySpark: write Gold layer partitioned by Policy_Year
from pyspark.sql import functions as F

gold_df = (
    silver_claims_df
    .withColumn("Policy_Year", F.year("policy_start_date"))
    .withColumn("claim_amount", F.coalesce(F.col("claim_amount"), F.lit(0)))
)

(gold_df.write
  .format("delta")
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .partitionBy("Policy_Year")
  .save("abfss://gold@datalake.dfs.core.windows.net/insurance/claims")
)

-- Z-Order: cluster Claim_ID + Policy_Number within each partition
OPTIMIZE gold.insurance_claims
ZORDER BY (Claim_ID, Policy_Number);

-- Retain 7-day time-travel, purge stale files
VACUUM gold.insurance_claims RETAIN 168 HOURS;

2. Risk Pattern Detection — RANK, LAG & Running Totals

High-frequency claimant detection using LAG (days-between-claims), cumulative exposure totals, and per-year claim ranking to feed the actuarial risk model.

WITH claim_timeline AS (
    SELECT
        c.customer_id, cl.claim_date, cl.claim_amount, cl.claim_type,
        DATEDIFF('day',
            LAG(cl.claim_date) OVER (PARTITION BY c.customer_id ORDER BY cl.claim_date),
            cl.claim_date
        )                                           AS days_since_prev_claim,
        SUM(cl.claim_amount) OVER (
            PARTITION BY c.customer_id
            ORDER BY     cl.claim_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        )                                           AS cumulative_exposure,
        DENSE_RANK() OVER (
            PARTITION BY c.customer_id, YEAR(cl.claim_date)
            ORDER BY     cl.claim_amount DESC
        )                                           AS claim_rank_ytd
    FROM  fact_claims   cl
    JOIN  dim_customer  c  ON cl.customer_id   = c.customer_id
    JOIN  dim_policy    p  ON cl.policy_number = p.policy_number
    JOIN  dim_product   pr ON p.product_code   = pr.product_code
    WHERE p.policy_year = YEAR(CURRENT_DATE()) - 1
),
risk_flags AS (
    SELECT *,
        CASE
            WHEN days_since_prev_claim < 30     THEN 'HIGH_FREQUENCY'
            WHEN cumulative_exposure   > 500000  THEN 'HIGH_EXPOSURE'
            WHEN claim_rank_ytd = 1
             AND claim_amount   > 100000  THEN 'LARGE_SINGLE_CLAIM'
            ELSE 'NORMAL'
        END AS risk_category
    FROM claim_timeline
)
SELECT * FROM risk_flags
WHERE  risk_category <> 'NORMAL'
ORDER BY cumulative_exposure DESC;

PROJECT 03 Migration / Python + SQL

Zenith-Migrator

Legacy SQL Server → Azure Data Lake

Python CLI Bulk Insert Checksum Audit Clustered Index

A production-grade Python CLI that orchestrates the full ETL lifecycle from legacy SQL Server 2012 to Azure Data Lake Gen2. Python handles pre-flight schema validation and checksum generation; SQL handles bulk loads and post-load clustered index rebuilds. Migrated 1.4 million rows with automated 0% data-loss verification.

1. Python Pre-Flight Validation CLI

Before a row moves, the migrator validates schemas match, computes source checksums, and aborts on mismatch — preventing silent data corruption.

# zenith_migrator/cli.py
import click, pyodbc
from azure.storage.filedatalake import DataLakeServiceClient

@click.command()
@click.option("--table",   required=True)
@click.option("--batch",   default=50000)
@click.option("--dry-run", is_flag=True)
def migrate(table, batch, dry_run):
    """Migrate a SQL Server table to ADLS with checksum audit."""
    schema_diff = compare_schemas(get_source_schema(table),
                                   get_target_schema(table))
    if schema_diff:
        click.echo(f"Schema mismatch: {schema_diff}", err=True)
        raise click.Abort()

    src = compute_checksum(table, source=True)
    click.echo(f"Source: {src['row_count']:,} rows | hash={src['col_hash']}")

    if not dry_run:
        bulk_migrate(table, batch_size=batch)
        tgt = compute_checksum(table, source=False)
        if src['row_count'] != tgt['row_count']:
            raise RuntimeError("DATA LOSS DETECTED")
        click.echo(f"Verified: {tgt['row_count']:,} rows | 0% data loss")

def compute_checksum(table, source=True):
    query = f"""
        SELECT COUNT(*)             AS row_count,
               CHECKSUM_AGG(CHECKSUM(*)) AS col_hash
        FROM   [{table}] WITH (NOLOCK)
    """
    with (get_src_conn() if source else get_tgt_conn()) as conn:
        r = conn.execute(query).fetchone()
        return {"row_count": r[0], "col_hash": r[1]}

2. Bulk Load + Post-Load Clustered Index

Indexes are intentionally dropped before bulk insert (minimally logged) and rebuilt after — the correct sequence for maximum throughput on large row sets.

-- Drop index before bulk load (enables minimal logging)
IF EXISTS (SELECT 1 FROM sys.indexes
           WHERE object_id = OBJECT_ID('dbo.migration_staging')
           AND   name      = 'cidx_staging_pk')
    DROP INDEX cidx_staging_pk ON dbo.migration_staging;

-- Bulk load with TABLOCK for minimal logging
BULK INSERT dbo.migration_staging
FROM '\\staging\export\orders_001.csv'
WITH (
    FIELDTERMINATOR = ',', ROWTERMINATOR = '\n',
    FIRSTROW        = 2,   BATCHSIZE      = 50000,
    TABLOCK, CHECK_CONSTRAINTS, FIRE_TRIGGERS
);

-- Rebuild clustered index after all batches complete
CREATE CLUSTERED INDEX cidx_staging_pk
    ON dbo.migration_staging (order_id)
    WITH (SORT_IN_TEMPDB = ON, FILLFACTOR = 90, ONLINE = OFF);

-- Non-clustered covering index for query patterns
CREATE NONCLUSTERED INDEX nidx_staging_cust_date
    ON dbo.migration_staging (customer_id, order_date)
    INCLUDE (order_total, status)
    WITH (FILLFACTOR = 85, ONLINE = OFF);

-- Row-count integrity guard
DECLARE @src BIGINT = 1412847;
DECLARE @tgt BIGINT = (SELECT COUNT_BIG(*) FROM dbo.migration_staging WITH(NOLOCK));
IF @src <> @tgt
    THROW 50001, 'DATA LOSS: row count mismatch post-migration.', 1;
PRINT 'Migration audit PASSED — zero data loss confirmed.';

PROJECT 04 FinTech / Time-Series

Chaos Arbitrageur

Trading Signal Engine

Time-Series SQL Sliding Windows Composite Index VWAP / Volatility

Real-time financial event processing pipeline. Ingests tick-level market data and computes moving averages, volatility metrics, and cross-asset signals using sliding window SQL. A composite index on (Ticker, Timestamp) enables sub-second signal retrieval across millions of daily events.

[ Architecture Diagram Placeholder — Chaos Arbitrageur ]

Market Feed API → Kafka Ingest → Bronze (raw ticks)
→ Silver (OHLCV normalised) → Gold (signals, moving avgs, VWAP)
→ Composite Index (Ticker, Timestamp) → Sub-second Query Layer
→ Signal Dashboard / Webhook Alerts

1. Composite Index Design — Sub-Second Signal Retrieval

All query patterns are ticker-first with a time range. The composite clustered index on (ticker, trade_ts) guarantees index seeks, not scans, at any scale.

-- Partitioned tick table, composite cluster key
CREATE TABLE market_ticks (
    tick_id     BIGINT        IDENTITY(1,1),
    ticker      VARCHAR(10)   NOT NULL,
    trade_ts    DATETIME2(6) NOT NULL,
    open_price  DECIMAL(18,6),
    high_price  DECIMAL(18,6),
    low_price   DECIMAL(18,6),
    close_price DECIMAL(18,6),
    volume      BIGINT,
    trade_date  DATE          NOT NULL
) ON ps_trade_daily(trade_date);

-- Clustered composite: ticker first, then timestamp
CREATE CLUSTERED INDEX cidx_ticks_ticker_ts
    ON market_ticks (ticker ASC, trade_ts ASC);

-- Covering NCCI: OHLCV on the same key avoids key lookups
CREATE NONCLUSTERED INDEX nidx_ticks_ohlcv
    ON market_ticks (ticker, trade_ts)
    INCLUDE (open_price, high_price, low_price, close_price, volume)
    WITH (FILLFACTOR = 80);

2. Sliding Window — SMA, Volatility & VWAP

SMA-5, SMA-30, rolling standard deviation (volatility), VWAP, and Golden/Death Cross detection all in a single CTE chain — no application-layer computation needed.

WITH price_base AS (
    SELECT
        ticker, trade_ts, trade_date, close_price, volume,
        AVG(close_price) OVER (
            PARTITION BY ticker ORDER BY trade_ts
            ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
        ) AS sma_5,
        AVG(close_price) OVER (
            PARTITION BY ticker ORDER BY trade_ts
            ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
        ) AS sma_30,
        STDEVP(close_price) OVER (
            PARTITION BY ticker ORDER BY trade_ts
            ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
        ) AS volatility_30,
        SUM(close_price * volume) OVER (
            PARTITION BY ticker, trade_date
            ORDER BY     trade_ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) / NULLIF(SUM(volume) OVER (
            PARTITION BY ticker, trade_date
            ORDER BY     trade_ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 0
        )                         AS vwap,
        (close_price - LAG(close_price,1) OVER (PARTITION BY ticker ORDER BY trade_ts))
        / NULLIF(LAG(close_price,1) OVER (PARTITION BY ticker ORDER BY trade_ts),0)
                                  AS pct_return
    FROM market_ticks
    WHERE trade_date >= DATEADD(day,-90,CAST(GETDATE() AS DATE))
),
signals AS (
    SELECT *,
        CASE
            WHEN sma_5 > sma_30
             AND LAG(sma_5)  OVER (PARTITION BY ticker ORDER BY trade_ts)
              <= LAG(sma_30) OVER (PARTITION BY ticker ORDER BY trade_ts)
                THEN 'GOLDEN_CROSS'
            WHEN sma_5 < sma_30
             AND LAG(sma_5)  OVER (PARTITION BY ticker ORDER BY trade_ts)
              >= LAG(sma_30) OVER (PARTITION BY ticker ORDER BY trade_ts)
                THEN 'DEATH_CROSS'
        END AS signal
    FROM price_base
)
SELECT ticker, trade_ts, close_price,
       ROUND(sma_5,4) AS sma_5, ROUND(sma_30,4) AS sma_30,
       ROUND(volatility_30,6) AS volatility, ROUND(vwap,4) AS vwap,
       ROUND(pct_return*100,4) AS pct_return, signal
FROM  signals
WHERE signal IS NOT NULL
ORDER BY trade_ts DESC;

// Consulting Experience

Enterprise Engagements
Real Business Impact

Described using industry-level terminology to respect client NDAs. Sectors and outcomes are accurate.

Decision Inc.

Analytics Consulting Firm · Johannesburg

Retail / FMCG

Semantic Layer Redesign

Rebuilt a legacy SSAS cube into a modern star schema on Azure Synapse for a large retail client. Introduced SCD Type 2 across all slowly-changing dimensions, eliminating aggregation inconsistencies in monthly management reporting packs.

Power BI Performance Engineering

Applied partition-aligned materialized views and incremental refresh policies, reducing 30-minute daily refresh cycles to under 4 minutes — a 7.5× improvement by restricting refreshes to the most recently modified partition.

ETL Governance Framework

Built row-level audit tables, CHECKSUM_AGG validation, and data lineage tagging across 14 ADF pipelines. Zero pipeline failures in the 3 months post-implementation.

Azure Synapse Power BI ADF SQL Server SCD Type 2

PBT Group

Data & Analytics Specialists · SA

Insurance / FSI

Lakehouse Migration (Insurance)

Led the design of a Medallion Architecture migration for an insurance client, moving from on-prem Oracle to Azure Data Lake Gen2 + Databricks. Defined Bronze/Silver/Gold zone schemas and all ingestion patterns.

Claims Analytics Engineering

Implemented window-function risk scoring (RANK, LEAD/LAG) in Databricks SQL to support an actuarial team. Reduced time-to-insight from weekly batch reports to near-real-time dashboards.

Data Quality Framework

Built a Python profiling & validation suite (Great Expectations), enforcing schema contracts at the Silver layer boundary. Caught 23 upstream schema breaks before they propagated to Gold.

Databricks Delta Lake Python Oracle → Azure Great Expectations

// High-Demand Niche · South Africa

Legacy Modernisation Specialist

South African banking, retail, and insurance enterprises are sitting on decades of technical debt: Oracle environments, SSAS cubes, flat-file ETL, and fragile SSIS packages. There is a growing — and largely unmet — demand for engineers who can speak both the legacy language and the cloud future.

I bridge that gap: legacy stack fluency for safe migrations, cloud architecture knowledge for building something worth migrating to. That combination — legacy fluency + modern lakehouse architecture — is the niche.

// Let's Build

Ready to Architect
Your Data Platform?

Whether you need a lakehouse built from scratch, a legacy migration executed cleanly, or a senior engineer to close a technical gap — let's talk.

Email

mthapelo907@gmail.com

thapelo-tshiamo-m-b3677b202

GitHub

TshiamoTLM

Based in Johannesburg, SA · Available for remote & on-site contracts

Thapelo TshiamoMoloto

Data EngineeringDone Properly

VerifiedCertifications