Exploring a Metadata-Driven Reporting Approach for OpenELIS (Prototype + Feedback Request)

OpenELIS Reporting Prototype – Feedback Request

@Moses_Mutesasira

Hi everyone,

I wanted to share a small prototype I’ve been working on for the OpenELIS reporting framework and get your feedback on whether this direction makes sense.

Over the past few weeks, I’ve been trying to understand how the current reporting system is structured and how reports are generated. Based on that, I explored a metadata-driven approach, where reports are defined as database records instead of Java classes.


Prototype Scope

  • Dynamic query construction using JPA Criteria API
  • Metadata-driven report definition:
    • datasets
    • fields
    • filters
    • columns
  • Basic separation between:
    • query definition
    • query execution
    • output rendering (CSV)

Supported Features

  • FIXED filters (applied silently)
  • PROMPT_USER filters (validated at runtime)
  • Basic whitelist/validation to keep query construction safe

The goal here was mainly to validate the core query-building approach and safety model, not to fully generalize everything yet.


Feedback I’m Looking For

  • Does this approach align with the future direction of the OpenELIS reporting framework?
  • Are there any suggestions on how I could better adapt this to fit upcoming plans or architectural goals?

I’d really appreciate any guidance or advice. I’m happy to learn from feedback and iterate on this further.


Context from Yesterday’s Meeting (16th April)

I found the suggestions around:

  • FHIR analytics / data pipelines
  • CQRS

really useful, and I’ve started exploring how this design could align with those ideas.

At a high level, I’m thinking of this as:

  • a potential read-model layer (in line with CQRS concepts)
  • something that could later integrate with pipeline-based analytics systems

But I’m still exploring this, so any guidance here would be really helpful.


Flow Explanation (Simple)

At a high level, the prototype follows this flow:

  1. A report is defined using metadata(db record):

    • datasets
    • fields
    • filters
    • columns
  2. The system dynamically builds a query using the JPA Criteria API

    • FIXED filters are enforced automatically
    • PROMPT_USER filters are taken as input and validated
  3. The query is executed against the database

  4. Results are formatted into CSV output

The idea is to keep report logic outside of Java code and instead drive everything through structured metadata, making it easier to extend and modify reports without code changes.


Links


Note

This is a minimal prototype intended to demonstrate the core idea. Some parts (such as field path resolution and join handling) are intentionally hardcoded or simplified to validate the approach against the OpenELIS entity model.

thanks for this approach it really gives direction @junaid, I have gone through it and its really soul lifting how ever i have a few concerns,
Querying the live transactional database for heavy ad-hoc reports is generally discouraged. Specifically, using a correlated subquery for SampleHuman could result in N+1 execution bottlenecks when generating reports across tens of thousands of samples. How do you plan to mitigate the load this places on the live OLTP database?

It’s great to see how the engine handles standard column filtering. I noticed that aggregations (like COUNT/SUM) aren’t supported and that complex crosstabs are routed back to LEGACY_PIVOT. Since those are often the most time-consuming reports to maintain, did you run into limitations with the Criteria API for handling dynamic GROUP BY and pivoting? I’m wondering if piping this data into an external analytics or BI tool might be an easier way to handle those complex matrix reports without having to write custom Java builder logic.

I also really like the fhir disccusion in our last call and data pipelines. the long-term goal for OpenELIS has been to adopt FHIR as the standard data pipeline which we are already implementning, how do you see this custom JPA query engine fitting in? If we want dynamic reporting and analytics, doesn’t it make more architectural sense to extract those FHIR bundles into a dedicated FHIR repository or analytics pipeline?

Hi Derrick,

Thank you for the thoughtful feedback — I really thankful and appreciate you taking the time to stress-test the design. Your points around OLTP load, aggregation complexity, and the FHIR roadmap helped me re-evaluate the system from a more production-oriented perspective.

1. Managing OLTP Performance & Stability

You’re absolutely right about the risks of querying the live transactional database. In the current prototype, the goal was to validate the metadata-driven query engine, but I agree this approach needs stronger safeguards for production use.

Specifically regarding the correlated subquery and potential N+1 issues, I’m planning to refactor this into a join-based or batch-fetch strategy to avoid per-row execution overhead.

In addition, I see a layered mitigation approach:

  • Query-level safeguards: row limits, timeouts, and restricted query shapes
  • Execution strategy: avoiding correlated subqueries in favor of joins
  • Architectural direction: eventually routing reporting workloads to a read replica or asynchronous reporting store

This ensures we reduce both query inefficiency and overall load on the OLTP system.

2. Scoping Aggregation & BI Complexity

You’re right that aggregations and cross-tab reports are some of the most complex and maintenance-heavy parts of reporting systems.

I did explore dynamic GROUP BY and pivoting using the Criteria API, but it quickly became clear that building a fully generic and maintainable abstraction for these within the GSoC scope would introduce significant complexity and long-term maintenance risk.

So the decision to defer them was intentional:

  • Focus on delivering a stable, metadata-driven filtering engine first
  • Avoid introducing fragile query-generation logic early

Your suggestion to integrate with external BI/analytics tools makes a lot of sense here. I see this as a cleaner separation of concerns:

  • Core engine → structured, operational reporting
  • External tools → advanced aggregations, pivots, and analytics

I’d definitely like to explore how we can design the system to support that kind of integration cleanly.

3. FHIR & Data Pipeline Alignment

I really liked the FHIR discussion as well — it helped clarify the long-term direction significantly.

I see this query engine acting as a transitional layer (“bridge”):

  • In the current state → operates on the relational schema for immediate reporting needs
  • In the future → the same abstraction can point to a FHIR-backed data source or analytics pipeline

To support this, I’m designing the engine with a decoupled data source layer, so that:

  • Query construction remains consistent
  • Only the underlying data provider changes (JPA → FHIR repository / analytics store)

Long-term, I agree that extracting FHIR bundles into a dedicated analytics pipeline is the more scalable direction. My goal is to ensure that what we build now doesn’t become a dead-end, but instead evolves naturally into that architecture.

I’d be happy to iterate further on any of these areas, especially around query optimization or how this could align more closely with the FHIR pipeline design.