Wednesday, 4 March 2009

DWH Maintenance team: Be aware of your Points of Failure

I met a lot of situations where the end user is notifying the Data Warehouse administration team that certain processes did not run while it should be the maintenance team itself who should notice the error (and communicate to the business about the problem, including an estimated time to fix).

To understand which connections, flows, objects or systems can bring the entire data flow to a halt you should first determine every single point of failure within the whole architecture (like network availability, database availability, certain constraints in extraction scripts, file transfer scripts etc.). Although you should minimize the number of points of failure, the number of layers in a Data Warehouse environment results in several points of failure. As long as these are known, monitored and ´how-to-repair´ documentation is available and used, the change on a total halt for an unacceptable time is minimized.

As you can define almost an unlimited number of point of failures you should focus on system level (like free disk space), application level (like the availability of Mirror jobs) and data flow level (like File Transfer processes and error handling tables with non- loaded data).

Don’t understimate the number of point of failuers. In the graphical overview below a Data Warehouse environment is presented.
To be analyzed possible single points of failure are marked with a sign. Also the system itself which includes these point of failures can be defined as a point of failure.

After detecting these point of failures, you should determine whether and how these points of failures needs to be monitored.
Be aware that you, as an administrator, only migrate the ETL, copy and frontend scripts if the developer transfers fail-over documentation, so you know what to do in case of failures.

No comments:

Post a Comment

Thanks for your comment.