Error Handling Approaches
Handlers are a procedure or process that takes in a value or set of values, completes some process, and returns a value or set of values. As with any code, handlers can error. They most often recieve an error while trying to process the inputs. This is most often because of bad data, because the receiving system is down, or because of some coding error in the handler. If the handler is put together properly, this does not need to stop the workflows, however.
Branches for Errors
Handlers can be coded to return an error if one is found during processing. The handler can return the error raw or can process the error internally and return a limited set of values for the error value. Either way, this allows the developer to process/handle errors in the workflows. For example, there can be a branch to handle if you tried to create a ticket for a user that did not exist. It can create the user and then retry the create; or it could assign a ticket to a team than manages user data. It all depends what the correct process is for your organization. The benefit, though, is that the workflow doesn't need to stop. The errors can be handled just as any successful results would be and have their own branch(es) on the workflow.
Process for Externally Handling Errors
It is also possible to create a generic error processing routine that empowers a set of administrators to handle errors that are generated by the workflows. This works in a very similar manner to branching from handlers for errors, but is more all-encompassing. This is the process that is in place in Kinetic's kinops SaaS solution.
How this works is that every handler is set up to return an error if there is an error found during processing. This error is returned raw (not processed further inside the handler). Each handler is then wrapped in a routine that takes and returns the same values as the handler. Within this routine, however, the handler is called and if an error is returned, it calls the error handling routine. The output of that routine is then passed into an instance of the same handler routine (making the handler routine recursive). This allows an error to be caught if the updates made during the error handling didn't fix the problem or just lead to another.
Inside the error handling process, a ticket is created for a team. The creation of the ticket allows this process to be handed off to people who do not have access to the workflow interface. It also allows these administrators to be able to hand off the tickets to other teams if appropriate. This ticket includes the error and a series of choices: skip the handler, retry the node, update the inputs to the node and retry, or do nothing. The choice of retry the node is assuming some correction has been made to the receiving system or data by the person working the ticket and that the node retry should work. The choice of update the inputs and retry gives the person working the ticket more flexibility.
This is a very powerful system. It helps prevent things from getting lost in the error console of the workflow engine for companies who may have only one or two resources with access to that, but many people and workflows in the system. However, it doesn't allow for smart automation of the error fixes as mentioned in the previous section. It is possible to blend the two solutions as necessary and appropriate.
Process for Retrying Inside the Engine
Sometimes handlers haven't been set up to return errors or the correct error process has not yet been set up. In those cases, the tools within the workflow engine itself need to be leveraged to process and move past errors. The place to start for this is the Errors area. This will have a description of the error, a stack trace for the error, and the available choices of what to do about the error. The error page will indicate the run that experienced the error (with a link to it) and the parent run of that run (with a link to it). Handler errors will allow you to skip, retry, or do nothing.
What to do about the error entirely depends on what the error is. If the error was that the remote system could not be reached and you know it is now up, you may just choose retry. If it was an issue with data, you may need to go to the run and update some inputs or results before you are able to retry. If you would be better off starting the run of that tree or routine all over again and starting that entire workflow the error occurred within from the beginning, you may choose to do nothing and rerun the tree/routine instead.
Note that it is rare to skip a node. Most nodes are doing important functions and/or are referenced later in the tree. Skipping nodes is often not workable. If, however, this node is not strictly needed for some reason, it is possible to skip the node.
Updated 12 months ago