IN.IDEA: Performance and Scaling
[!IMPORTANT] Architectural Status: The infrastructure described in this document—specifically the Postgres-as-Single-Source-of-Truth (SSoT) and the automated ETL Pipeline — is currently a conceptual design. It serves as a roadmap for implementation and is not yet part of the active codebase.
1. Core Objectives
The IN.IDEA graph model employs a deep layered architecture to preserve epistemic nuance. While this is ideal for data integrity, it introduces high "hop-counts" for standard queries. This strategy defines how we denormalize specific paths during the transfer from the relational SSoT to the Graph DB to achieve:
- O(1) Hierarchy Lookups: Bypassing recursive traversals in the Tree of Concepts.
- Low-Latency Standard Views: Providing direct access to high-certainty data for front-end consumers.
- Reduced Computational Overhead: Shifting complex logic from query-time to ingest-time.
2. Hierarchical Flattening (Materialized Paths)
To avoid expensive recursive :IS_A traversals within the Concept hierarchy, we implemented Materialized Path Arrays.
Implementation: concept_path_ids
Each Concept node is enriched with a property containing its entire ancestral lineage.
- Property:
concept_path_ids, Type:LIST<INT>(production) /LIST<STRING>(fixtures). - Content: An array of all concept_ids in the path, including the node's own ID (e.g.,
["self-id", "parent-id", "root-id"]). - Technical Benefit: Enables high-performance filtering using Neo4j's array indexing.
Query Performance Analysis
[!NOTE] Benchmarking Disclaimer: The millisecond values recorded here reflect the current fixture set and are intended solely to show relative acceleration. In a production environment, the execution plan structure is the primary indicator of performance rather than these specific timings.
1. Traditional Query (Recursive Traversal)
This approach navigates the physical layers and then performs a recursive search up the taxonomic tree to identify objects.
// TRADITIONAL APPROACH: Use recursive IS_A relationships
MATCH (u:Unit)-[:HAS_COMPOSITION]->(:Composition)-[:HAS_COMPOSITION_ENTITY]->(ce:CompositionEntity)
MATCH (ce)-[:HAS_INTERPRETATION]->(i:Interpretation)-[:IDENTIFIED_AS_CONCEPT]->(baseConcept:Concept)
// UPWARDS NAVIGATION: Traverse the hierarchy upwards to find the target ancestor
MATCH (baseConcept)-[:IS_A*0..]->(target:Concept {concept_id: 'concept-object'})
// OUTPUT: Retrieve the desired identification details
RETURN
u.unit_id AS Unit,
ce.composition_entity_id AS CompositionEntity,
i.certainty AS IdentificationCertainty,
baseConcept.concept_id AS IdentifiedConcept;
Result Set
| Unit | CompositionEntity | IdentificationCertainty | IdentifiedConcept |
|---|---|---|---|
| "unit-apple-tomato" | "entity-at-apple-tomato" | 0.5 | "concept-tomato" |
| "unit-apple-tomato" | "entity-at-apple-tomato" | 0.4 | "concept-apple" |
| "unit-billiard-ball" | "entity-bb-billiard-ball-sphere" | 1.0 | "concept-billiard-ball-7" |
Full Execution Profile
| Operator | Id | Details | Rows | DB Hits |
|---|---|---|---|---|
| ProduceResults | 0 | Unit, CompositionEntity, Certainty, IdentifiedConcept | 3 | 0 |
| Projection | 1 | u.unit_id, ce.composition_entity_id, i.certainty, baseConcept.concept_id | 3 | 12 |
| Filter | 2 | u:Unit | 3 | 3 |
| Expand(All) | 3 | (anon_1)<-[:HAS_COMPOSITION]-(u) | 3 | 25 |
| Filter | 4 | anon_1:Composition | 3 | 3 |
| Expand(All) | 5 | (ce)<-[:HAS_COMPOSITION_ENTITY]-(anon_1) | 3 | 19 |
| Filter | 6 | ce:CompositionEntity | 3 | 3 |
| Expand(All) | 7 | (i)<-[:HAS_INTERPRETATION]-(ce) | 3 | 18 |
| Filter | 8 | i:Interpretation | 3 | 3 |
| Expand(All) | 9 | (baseConcept)<-[:IDENTIFIED_AS_CONCEPT]-(i) | 3 | 23 |
| Filter | 10 | baseConcept:Concept | 7 | 7 |
| VarLengthExpand(All) | 11 | (target)<-[:IS_A*0..]-(baseConcept) | 7 | 13 |
| NodeUniqueIndexSeek | 12 | UNIQUE target:Concept(concept_id) WHERE concept_id = 'concept-object' | 1 | 2 |
Total database accesses: 131 | Execution time: ~63 ms
Commentary
The critical bottleneck in this plan is the VarLengthExpand(All) operator (Id 11). This forces the database to perform a live, recursive search through the ontology for every potential result. While functional for small fixtures, this operation scales poorly; as the taxonomy grows deeper, query latency increases linearly with each additional level in the hierarchy.
2. Optimized Query (Materialized Path)
This approach leverages the Hierarchical Flattening strategy by checking a pre-calculated array of all ancestors stored on each concept node.
// OPTIMIZED APPROACH: Use the index to find all concepts belonging to 'concept-object'
MATCH (c:Concept)
WHERE 'concept-object' IN c.concept_path_ids
// BACKWARDS NAVIGATION: Move from the semantic layer back to the physical data
MATCH (c)<-[:IDENTIFIED_AS_CONCEPT]-(i:Interpretation)
MATCH (i)<-[:HAS_INTERPRETATION]-(ce:CompositionEntity)
MATCH (ce)<-[:HAS_COMPOSITION_ENTITY]-(comp:Composition)<-[:HAS_COMPOSITION]-(u:Unit)
// OUTPUT: Retrieve the desired identification details
RETURN
u.unit_id AS Unit,
ce.composition_entity_id AS CompositionEntity,
i.certainty AS IdentificationCertainty,
c.concept_id AS IdentifiedConcept;
Result Set
| Unit | CompositionEntity | IdentificationCertainty | IdentifiedConcept |
|---|---|---|---|
| "unit-apple-tomato" | "entity-at-apple-tomato" | 0.5 | "concept-tomato" |
| "unit-apple-tomato" | "entity-at-apple-tomato" | 0.4 | "concept-apple" |
| "unit-billiard-ball" | "entity-bb-billiard-ball-sphere" | 1.0 | "concept-billiard-ball-7" |
Full Execution Profile
| Operator | Id | Details | Rows | DB Hits |
|---|---|---|---|---|
| ProduceResults | 0 | Unit, CompositionEntity, Certainty, IdentifiedConcept | 3 | 0 |
| Projection | 1 | u.unit_id, ce.composition_entity_id, i.certainty, baseConcept.concept_id | 3 | 12 |
| Filter | 2 | 'concept-object' IN baseConcept.concept_path_ids AND baseConcept:Concept | 3 | 12 |
| Expand(All) | 3 | (i)-[:IDENTIFIED_AS_CONCEPT]->(baseConcept) | 9 | 50 |
| Filter | 4 | i:Interpretation | 9 | 9 |
| Expand(All) | 5 | (ce)-[:HAS_INTERPRETATION]->(i) | 9 | 36 |
| Filter | 6 | ce:CompositionEntity | 6 | 6 |
| Expand(All) | 7 | (anon_1)-[:HAS_COMPOSITION_ENTITY]->(ce) | 6 | 18 |
| Filter | 8 | (u:Unit AND anon_1:Composition) | 2 | 4 |
| DirectedRelationshipTypeScan | 9 | (u)-[:HAS_COMPOSITION]->(anon_1) | 2 | 3 |
Total database accesses: 150 | Execution time: ~12 ms
Commentary
The recursive expander has been replaced by a local Filter (Id 2). This constitutes a hierarchy lookup; the cost to verify if a concept is an "object" is now constant and independent of the taxonomy's depth.
Although the total DB hits are slightly higher (150 vs 131), this is a deliberate choice by the Cost-Based Optimizer (CBO) for small datasets. Because the fixture set is tiny, the CBO determines that a full relationship scan (Id 9) is more efficient than the overhead of an index seek. In production environments with > 100k of Unit nodes, the CBO would pivot to an Index Seek, ensuring high-speed access that scales horizontally.
3. The Epistemic Bypass (Shortcut Relations)
The Epistemic Layer uses Interpretation nodes to model uncertainty. For the majority of end-user queries, we can bypass this layer when the data meets specific "Gold Standard" criteria.
3.1 Entity-to-Concept Shortcut
We collapse the path from a formal element to its semantic identification.
- Source Node:
CompositionEntity. - Target Node:
Concept. - Relationship:
HAS_CERTAIN_PRIMARY_CONCEPT. - Materialization Trigger:
Interpretation.status == "Primary".Interpretation.certainty >= 0.8(Current threshold).
3.2 Composition-to-Reading Shortcut
We bypass layout-based grouping nodes (e.g., CompositionGroup) to allow direct text-based filtering on the composition level.
- Source Node:
Composition. - Target Node:
Reading. - Relationship:
HAS_CERTAIN_PRIMARY_READING. - Materialization Trigger:
Reading.status == "Primary".Reading.certainty >= 0.8(Current threshold).
4. Proposed ETL Workflow (Conceptual)
The materialization logic resides entirely within the ETL (Extract, Transform, Load) pipeline.
| Step | Action | Description |
|---|---|---|
| 1. Extract | Daily Snapshot | Pull latest state from Postgres SSoT. |
| 2. Transform | Path Calculation | Compute the concept_path_ids array using a recursive CTE in Postgres. |
| 3. Transform | Shortcut Logic | Evaluate status and certainty thresholds to determine shortcut eligibility. |
| 4. Load | Graph Rebuild | Wipe the Neo4j instance (or specific labels) and perform a high-speed UNWIND load. Use Blue-Green Deployment to avoid downtime |
5. Performance Monitoring
Post-materialization, the following indices must be maintained in Neo4j to ensure the effectiveness of this strategy:
- Lookup Index: On
Concept(concept_path_ids). - Constraint: Unique IDs for all materialized relationships to prevent duplication during incremental updates (if implemented).