What Is the Data Discipline?


Data work combines engineering, science, and architecture. It means building scalable infrastructure for data movement, applying statistical modeling and machine learning, and using semantic structures like knowledge graphs to capture complex relationships.

Guiding Principles

  • Business understanding drives technical decisions.

  • Data quality and governance are non-negotiable.

  • Efficient pipelines enable reliable analytics.

  • Models must be explainable and reproducible.

  • Complex relationships deserve graph structures.

  • Well-modeled data is agent memory. Structured knowledge enables autonomous systems to reason, not just retrieve.

  • Optimize for both performance and maintainability.

  • Privacy, safety, compliance, and security by default.

  • Data lineage is a trust requirement, not an audit trail. Know where every value comes from before acting on it.

  • Test on small datasets before scaling to large ones.

  • Each query, each transformation, and each machine has a cost.

Integrated Data Workflow

  • Problem Definition & Architecture: Understand business objectives and design the data architecture (warehouse, lake, lakehouse). Establish KPIs and align stakeholder expectations.

  • Data Pipelines: Build robust ETL/ELT pipelines from SQL databases, APIs, streaming sources, and files.

  • Data Quality & Governance: Clean and transform data at scale. Handle missing values, remove duplicates, and document transformations.

  • Exploratory Analysis & Data Modeling: Perform EDA, model data as knowledge graphs, and design ontologies for semantic queries. Consider retrieval patterns early: how data is structured determines how effectively agents and LLMs can reason over it.

  • Knowledge Graph & LLM Integration: Bridge structured knowledge and language models. Design ontologies and schemas that support grounded retrieval, use graph traversal to augment LLM context, and expose knowledge graph APIs as tools for agentic systems.

  • Feature Engineering & Storage: Optimize storage, create features, and leverage graph databases and vector stores as complementary structures: graphs for explicit relationships, vectors for semantic similarity and retrieval.

  • Modeling & Analytics: Train, tune, and integrate models with knowledge graphs via embeddings.

  • Evaluation & Optimization: Assess performance, detect overfitting, optimize queries and pipelines.

  • Deployment & Monitoring: Deploy with CI/CD, build dashboards, monitor data quality, and retrain models as needed. For pipelines feeding agentic systems, extend monitoring to retrieval quality and semantic drift — the data may be fresh but structurally misaligned with how agents consume it.
@Nothofagus
FAVORITES EXPERIENCE MASTERCLASS CONNECT