NPX Thesis Proposal
Observability
RQ1:
Research Problem:
When querying PA to extract HPC usage data, gaps occur. This querying process further expands the gaps between the actual HPC usage and the registered HPC usage, possibly due to PA not passively observing or saving data during this action.
Research Question:
How can we accurately observe data pipelines and report correct HPC usage?
What’s Observability?
In a broad sense, observability is the degree to which one can understand the internal state or condition of a complex system based only on its external outputs. The more observable a system is, the faster and more accurately one can identify a performance problem and its root cause without additional testing or coding.
For this research problem, observability will be used for HPC usage monitoring.
Research Method (Possible Solution):
The solution lies in enhancing the observability of HPC clusters.
• Initiate new metrics and automated methods to identify when a gap exists.
• Examine the observability telemetry, to pinpoint the frequency of a substantial gap that needs to be addressed.
• Analyse the relationship between metrics and gaps to activate a solution to fix a gap before it becomes too large.
Based on these, we can construct a tool that can resolve this gap issue.
The primary goal is to passively observe the HCP cluster to collect important metrics used by the solution tool. This allows the tool to function like a queue: the captured metrics are placed in a queue to the tool that can verify if the PA has a gap in comparison to what the tool has observed.
Issue with the Proposed Solution:
Constantly triggering the database to try to achieve a gap of zero implies repeated invoking of the update. Hence, there's a need to establish an acceptable minimum gap for the business case. The critical task is to avoid a gap at definite thresholds like the end of the day, week, or month.
RQ2:
Research Problem:
The primary challenge revolves around the discrepancy between actual HPC usage and registered HPC usage. The solution for this problem ought to be generalized to address other gap problems emerging from HPC observability.
Research Question:
Can the previous solution be generalized when applied to a different system?
Research Method:
• Analyze the system for which we want to enhance the observability
• Determine whether we can measure parameters that can be used to stipulate system metrics
• Verify if these new metrics can be utilized for the previous solution.
This tactic should allow us to implement a generalized logic and customize it considering the system's structure, thereby using a generic solution on a different system by personalizing the metrics.
RQ3:
Research Problem:
Resilience of the data pipeline and its infrastructure.
Research Question:
How can we anticipate measurable predicaments that can disrupt the job? (e.g., pipeline crashes)
Research Method & Proposed Solution:
Define the scenarios in which the pipeline may fail:
• Execute an analysis on previous pipeline issues and their reports
• Identify some corner cases
RQ4:
Research Problem:
The data pipeline and the infrastructure possess a backups system capable of recovering a failed data pipeline.
Research Question:
How can we verify the efficient functioning of the backup system?
What’s chaos engineering?
Chaos engineering comprises a set of principles aimed at improving the resilience of the system. It involves building a hypothesis around steady-state behavior of the system, altering real-world events, running tests in production, automating experiments to run incessantly, and mitigating the blast radius of the experiments.
Research Method (Possible Solution):
Chaos engineering can be employed to check the status of the backups using chaos experiments.
The final aim of the proposed approach is theorized to augment the resilience of the backup system, ensuring that in the event of a pipeline malfunction, the recovery of that pipeline should be more productive, without the need to restart the pipeline from the beginning.
References:
https://www.ibm.com/topics/chaos-engineering
https://www.gremlin.com/community/tutorials/chaos-engineering-tools-comparison/.
Further information about NXP in the Netherlands can be found here.