Advanced error and result modeling in modern .NET systems
When systems become real, failure stops being a side concern.
In simple demo code, failure often looks like one thing: “something went wrong.” In production systems, especially WPF desktop systems connected to machines, cameras, PLCs, storage, and long-running workflows, that is not good enough. Some failures are expected. Some are exceptional. Some should stop everything immediately. Some should only produce a warning. Some should be shown to the operator in simple language. Some should only be logged for engineers. Some mean “retry later.” Some mean “your workflow logic is wrong.” These are very different things, and a good system should model them differently.
That is why advanced error handling is really about failure modeling, not only try/catch.
1. Big picture — why failure modeling matters as much as failure handling
A lot of engineers think about failure only at the point where code breaks. They think in terms of catch blocks, logs, message boxes, and retries. But the deeper issue is earlier than that: how do we represent failure in the first place?
That design choice shapes the whole system.
If every kind of failure becomes an exception, then callers cannot easily tell the difference between:
- “recipe validation failed because the user entered an invalid threshold”
- “machine rejected the command because it is not homed yet”
- “camera disconnected unexpectedly”
- “image save partially failed but inspection still completed”
- “developer bug: impossible state reached”
These are not the same. They should not look the same in code. They should not be handled the same way in the UI. They should not be logged with the same severity. They should not trigger the same operational response.
In production systems, callers need a clear contract:
- What can fail?
- Is that failure expected or unexpected?
- Is the caller supposed to handle it explicitly?
- Can the workflow continue?
- Should the operator intervene?
- Should the system retry automatically?
- Is the failure safe, unsafe, temporary, or fatal?
That is why failure modeling matters as much as failure handling. If the contract is vague, the code becomes vague. And vague code around failure becomes operational pain.
Real examples
A machine command execution service might expose this:
Task<bool> StartInspectionAsync(CancellationToken ct);This tells the caller almost nothing.
Did false mean:
- machine is disconnected?
- machine is busy?
- recipe is invalid?
- safety interlock is active?
- timeout?
- SDK threw?
- operation cancelled?
- command rejected because state is wrong?
The caller now has to guess, or depend on side channels like logs, out parameters, global state, or message events. That is weak design.
A better API tells the truth:
Task<Result<StartInspectionOutcome, MachineCommandError>> StartInspectionAsync(
InspectionRecipe recipe,
CancellationToken ct);Now the contract says: this operation has a success outcome, and expected command-related failures are modeled explicitly. That makes the system more honest.
2. Different kinds of failure
One of the biggest design improvements a senior engineer brings is the ability to distinguish failure types instead of flattening them.
Unexpected technical failures
These are failures the normal caller is not expected to handle as part of regular business flow.
Examples:
- null reference because of a bug
- corrupted internal state
- invalid cast
- race condition causing impossible state
- driver crash
- unmanaged SDK access violation wrapped by adapter boundary
- disk subsystem throwing unexpected IO exception outside known behavior
These are usually exception territory.
They represent either:
- a programming error
- an infrastructure fault outside the normal business contract
- a system integrity issue
Expected business or domain failures
These are failures that are normal outcomes in the domain.
Examples:
- recipe is invalid
- machine is in wrong state for command
- command rejected because door is open
- inspection cannot start because wafer is not loaded
- workflow step cannot proceed because preconditions are not satisfied
These are often better modeled as results, not exceptions.
They are not “surprising system breakage.” They are expected possibilities.
Validation errors
Validation errors deserve their own category because they are usually not runtime faults at all. They are input-quality problems.
Examples:
- threshold is out of allowed range
- required calibration file missing from recipe
- scan area overlaps invalid region
- operator entered negative exposure time
- recipe references camera profile that does not exist
Validation is often multi-error by nature. A good validation model can return all issues together rather than failing on the first one.
Recoverable vs unrecoverable failures
This distinction matters operationally.
Recoverable:
- transient network issue
- temporary file lock
- telemetry stream dropped but can reconnect
- command timeout where retry is safe
- image thumbnail save failed, core image save succeeded
Unrecoverable:
- safety state violation
- machine axis control unavailable
- corrupted results package
- invariant broken inside workflow engine
- persistent camera initialization failure required for run correctness
The design question is not only “did it fail?” but also “what is the safe next action?”
Warnings vs hard failures
Warnings mean the operation is still considered complete enough to proceed, but something important should be recorded or shown.
Examples:
- inspection completed but some thumbnails were not generated
- data archived locally but cloud upload deferred
- optional telemetry unavailable
- recipe used fallback calibration
Warnings are very important in industrial systems because many workflows should continue in degraded mode instead of failing completely.
Partial success
This is one of the most under-modeled areas.
Examples:
- 98 images saved, 2 thumbnails failed
- inspection completed, but one auxiliary statistics export failed
- report generated, but one optional annotation layer missing
- wafer scanned, but one region had low-confidence classification and was flagged
If your system models everything as either success or failure, you lose important reality. Production systems often produce mixed outcomes.
Why treating all of these the same leads to poor design
If all of them become exceptions:
- business logic gets buried in catch blocks
- logs become noisy
- expected operator actions look like crashes
- retries become inconsistent
- UI messages become technical
- workflows become fragile
If all of them become bool:
- callers lose meaning
- support cannot diagnose quickly
- monitoring becomes weak
- partial success disappears
- warnings get lost
- unsafe conditions may be ignored
The whole point of good failure modeling is to preserve meaning.
3. Real problems in a WPF desktop app controlling a wafer inspection machine
This kind of system is exactly where weak failure contracts cause chaos.
Imagine a WPF application controlling a wafer inspection machine. It coordinates:
- operator UI
- recipe loading and validation
- machine state
- motion control commands
- camera/image acquisition
- processing pipeline
- result storage
- alarms/logging
- long-running workflows
Now look at what can go wrong.
Machine command may fail for very different reasons
A StartInspectionAsync operation may fail because:
- machine is disconnected
- machine is not initialized
- machine is already running
- machine is in alarm state
- safety door is open
- vendor SDK timed out
- SDK threw native exception
- cancellation requested
- command acknowledged but completion event never arrived
These have different meanings.
The UI should not show the same message for all of them.
The workflow should not respond the same way either. “Machine not homed” is an operator-correctable domain issue. “Access violation in vendor camera SDK” is an engineering-level technical fault.
Workflow may partially complete
A run may successfully inspect wafers but fail to:
- save some preview images
- upload telemetry
- archive some raw debug traces
- enrich secondary analytics
- write optional audit attachment
That should not necessarily invalidate the whole run.
If your workflow engine only supports “success” or “throw,” it becomes too brittle.
UI needs operator-friendly messages while logs retain technical details
The operator needs something like:
Camera not available. Check connection and power, then retry.
The log needs something like:
VendorCameraException HRESULT=0x8007001F during InitializeCamera on CameraAdapter.InitializeAsync. Serial=CAM-07. NativeCode=DeviceBusy. DriverVersion=5.2.13.
Those are different views of the same incident. Good error modeling supports both.
Data pipeline may continue with degraded behavior
Suppose thumbnail generation fails because the GPU helper process crashes. The main inspection results are still valid and the machine can keep running. The workflow should continue, record a warning, and surface that the result package is degraded.
This is much better than either:
- crashing the whole run, or
- silently hiding the issue
Some failures should stop the run immediately
Examples:
- stage lost synchronization
- axis controller reported unsafe motion state
- core result data cannot be persisted
- recipe integrity invalidates measurement correctness
- emergency stop triggered
These are fail-fast conditions.
Others should only mark warnings
Examples:
- preview generation failed
- optional telemetry unavailable
- thumbnail save retried and still failed
- background diagnostics export failed
These should not automatically stop production.
This is why failure contracts must be explicit. In this domain, failure is part of the workflow model.
4. Exceptions vs result-style modeling
This is where teams often become dogmatic.
Some teams say: “exceptions are bad; use Result everywhere.” Others say: “C# already has exceptions; use them for everything.”
Both extremes are usually wrong.
When exceptions are the right tool
Exceptions are the right tool when something happened that the caller is not expected to model as a routine branch.
Good examples:
- programming bugs
- invariant violations
- impossible states
- unexpected SDK crashes
- serialization bug
- null where contract guaranteed non-null
- infrastructure fault outside the normal operation contract
Examples:
public async Task<InspectionPlan> BuildPlanAsync(Recipe recipe, CancellationToken ct)
{
if (recipe is null) throw new ArgumentNullException(nameof(recipe));
var machineConfig = await _configProvider.GetCurrentAsync(ct);
if (machineConfig.AxisCount <= 0)
throw new InvalidOperationException("Machine configuration is invalid.");
// ...
}This is fine. These are not “expected operator outcomes.”
When a Result-style return model is better
A result model is better when the caller should explicitly handle a known, normal possibility.
Examples:
- validation failure
- command rejection because state is wrong
- business rule not satisfied
- workflow step skipped
- partial completion with warnings
- optional operation failed but system can continue
Example:
public Task<Result<Unit, RecipeValidationError[]>> ValidateAsync(
InspectionRecipe recipe,
CancellationToken ct);That tells the caller: validation issues are expected, and you should handle them explicitly.
Why expected failures are often better as return values
Because they are part of the contract.
If the machine can validly reject a command because it is not in the correct state, that is not exceptional. It is a normal branch. Modeling it as an exception often pushes domain logic into technical control flow.
Example:
var result = await _machineService.StartInspectionAsync(recipe, ct);
if (result.IsFailure)
{
switch (result.Error.Code)
{
case MachineCommandErrorCode.InvalidState:
ShowOperatorMessage(result.Error.OperatorMessage);
return;
case MachineCommandErrorCode.MachineDisconnected:
ShowReconnectPrompt();
return;
default:
// escalate or fallback
break;
}
}This is clearer than catching five different custom exceptions for normal machine rejection cases.
Why unexpected failures are often better as exceptions
Because they should travel fast, preserve stack information, and signal abnormal conditions clearly.
If the internal workflow engine reaches an impossible state, returning Result.Fail("unexpected") often hides the severity.
Example:
if (!_stateMachine.CanTransition(currentState, trigger))
throw new InvalidOperationException(
$"Invalid workflow transition from {currentState} using {trigger}.");That is a system defect, not an expected domain outcome.
Trade-offs
Exceptions:
- good for abnormal faults
- preserve stack traces
- integrate naturally with async/await
- bad when overused for routine outcomes
- can make expected failures invisible in signatures
Results:
- make expected failure explicit
- improve contract clarity
- good for validation, domain rules, partial success
- can become verbose
- can lead to “plumbing fatigue” if overused everywhere
Experienced engineers do not choose one universal rule. They choose based on whether the failure is part of normal operation.
5. Result pattern in practice
A result pattern is simply a structured way to return outcome information without using exceptions for normal branches.
That sounds simple, but the important part is what you put into the result model.
A weak result model:
public sealed class Result
{
public bool Success { get; init; }
public string? Error { get; init; }
}This is not enough for production systems.
A stronger model usually needs:
- success/failure state
- error code/category
- operator-friendly message
- technical details or metadata
- warning collection
- partial success support
- maybe a typed success payload
Here is a realistic base model.
public enum ErrorCategory
{
Validation,
Domain,
Technical,
Timeout,
Connectivity,
Safety,
Concurrency,
Unexpected
}
public sealed record ErrorDetail(
string Code,
string Message,
ErrorCategory Category,
string? OperatorMessage = null,
IReadOnlyDictionary<string, object?>? Metadata = null);
public sealed class Result
{
private Result(bool isSuccess, IReadOnlyList<ErrorDetail> errors, IReadOnlyList<ErrorDetail> warnings)
{
IsSuccess = isSuccess;
Errors = errors;
Warnings = warnings;
}
public bool IsSuccess { get; }
public bool IsFailure => !IsSuccess;
public IReadOnlyList<ErrorDetail> Errors { get; }
public IReadOnlyList<ErrorDetail> Warnings { get; }
public static Result Success(params ErrorDetail[] warnings) =>
new(true, Array.Empty<ErrorDetail>(), warnings);
public static Result Failure(params ErrorDetail[] errors) =>
new(false, errors, Array.Empty<ErrorDetail>());
}
public sealed class Result<T>
{
private Result(bool isSuccess, T? value, IReadOnlyList<ErrorDetail> errors, IReadOnlyList<ErrorDetail> warnings)
{
IsSuccess = isSuccess;
Value = value;
Errors = errors;
Warnings = warnings;
}
public bool IsSuccess { get; }
public bool IsFailure => !IsSuccess;
public T? Value { get; }
public IReadOnlyList<ErrorDetail> Errors { get; }
public IReadOnlyList<ErrorDetail> Warnings { get; }
public static Result<T> Success(T value, params ErrorDetail[] warnings) =>
new(true, value, Array.Empty<ErrorDetail>(), warnings);
public static Result<T> Failure(params ErrorDetail[] errors) =>
new(false, default, errors, Array.Empty<ErrorDetail>());
}That is still compact, but much more usable.
Example: ValidationResult
Validation often returns multiple problems.
public sealed record ValidationIssue(
string Code,
string Field,
string Message,
string? SuggestedFix = null);
public sealed class ValidationResult
{
private ValidationResult(IReadOnlyList<ValidationIssue> issues)
{
Issues = issues;
}
public IReadOnlyList<ValidationIssue> Issues { get; }
public bool IsValid => Issues.Count == 0;
public static ValidationResult Valid() => new(Array.Empty<ValidationIssue>());
public static ValidationResult Invalid(params ValidationIssue[] issues) => new(issues);
}Usage:
public ValidationResult ValidateRecipe(InspectionRecipe recipe)
{
var issues = new List<ValidationIssue>();
if (recipe.ExposureTimeMs <= 0)
issues.Add(new("Recipe.Exposure.Invalid", "ExposureTimeMs", "Exposure time must be greater than zero."));
if (string.IsNullOrWhiteSpace(recipe.CameraProfile))
issues.Add(new("Recipe.CameraProfile.Missing", "CameraProfile", "Camera profile is required."));
if (recipe.ScanRegions.Count == 0)
issues.Add(new("Recipe.ScanRegions.Empty", "ScanRegions", "At least one scan region is required."));
return issues.Count == 0
? ValidationResult.Valid()
: ValidationResult.Invalid(issues.ToArray());
}Example: StartInspectionResult
Sometimes a dedicated outcome type is even clearer than a generic result.
public enum StartInspectionStatus
{
Started,
Rejected,
Warning
}
public sealed record StartInspectionOutcome(
StartInspectionStatus Status,
string? RunId,
IReadOnlyList<ErrorDetail> Warnings);
public sealed record MachineCommandError(
string Code,
string Message,
string OperatorMessage,
bool Retryable,
bool SafeToRetry,
MachineCommandErrorCode ErrorCode);
public enum MachineCommandErrorCode
{
InvalidState,
MachineDisconnected,
Timeout,
AlarmActive,
SafetyInterlock,
RecipeInvalid
}Service contract:
Task<Result<StartInspectionOutcome>> StartInspectionAsync(
InspectionRecipe recipe,
CancellationToken ct);Implementation sketch:
public async Task<Result<StartInspectionOutcome>> StartInspectionAsync(
InspectionRecipe recipe,
CancellationToken ct)
{
var validation = _recipeValidator.ValidateRecipe(recipe);
if (!validation.IsValid)
{
return Result<StartInspectionOutcome>.Failure(
new ErrorDetail(
"Recipe.Invalid",
"Recipe validation failed.",
ErrorCategory.Validation,
"Recipe is invalid. Review highlighted fields."));
}
if (!_machineState.CanStartInspection)
{
return Result<StartInspectionOutcome>.Failure(
new ErrorDetail(
"Machine.InvalidState",
$"Machine state '{_machineState.Current}' does not allow StartInspection.",
ErrorCategory.Domain,
"Machine is not ready to start inspection.",
new Dictionary<string, object?> { ["MachineState"] = _machineState.Current }));
}
try
{
var runId = await _machineAdapter.StartInspectionAsync(recipe, ct);
return Result<StartInspectionOutcome>.Success(
new StartInspectionOutcome(StartInspectionStatus.Started, runId, Array.Empty<ErrorDetail>()));
}
catch (OperationCanceledException)
{
throw;
}
catch (TimeoutException ex)
{
return Result<StartInspectionOutcome>.Failure(
new ErrorDetail(
"Machine.Start.Timeout",
ex.Message,
ErrorCategory.Timeout,
"Machine did not respond in time. Retry after checking connection."));
}
}Example: SaveImageResult with warnings and partial success
public sealed record SaveImageOutcome(
string ImageId,
string MainPath,
string? ThumbnailPath,
bool ThumbnailSaved);
public async Task<Result<SaveImageOutcome>> SaveImageAsync(
CapturedImage image,
CancellationToken ct)
{
var warnings = new List<ErrorDetail>();
string mainPath;
try
{
mainPath = await _imageStore.SaveMainImageAsync(image, ct);
}
catch (IOException ex)
{
return Result<SaveImageOutcome>.Failure(
new ErrorDetail(
"ImageSave.Main.Failed",
ex.Message,
ErrorCategory.Technical,
"Failed to save image data.",
new Dictionary<string, object?> { ["ImageId"] = image.Id }));
}
string? thumbnailPath = null;
bool thumbnailSaved = false;
try
{
thumbnailPath = await _imageStore.SaveThumbnailAsync(image, ct);
thumbnailSaved = true;
}
catch (Exception ex)
{
warnings.Add(new ErrorDetail(
"ImageSave.Thumbnail.Failed",
ex.Message,
ErrorCategory.Technical,
"Preview thumbnail could not be generated.",
new Dictionary<string, object?> { ["ImageId"] = image.Id }));
}
return Result<SaveImageOutcome>.Success(
new SaveImageOutcome(image.Id, mainPath, thumbnailPath, thumbnailSaved),
warnings.ToArray());
}This is a realistic production pattern: core operation succeeded, but some secondary work degraded.
Example: WorkflowStepResult
public enum WorkflowStepStatus
{
Completed,
Skipped,
Failed,
CompletedWithWarnings
}
public sealed record WorkflowStepResult(
string StepName,
WorkflowStepStatus Status,
IReadOnlyList<ErrorDetail> Errors,
IReadOnlyList<ErrorDetail> Warnings,
TimeSpan Duration);This is much better than bool ExecuteStep() for workflow orchestration.
6. Domain errors vs technical errors
This separation is crucial.
A machine operator does not care about a native HRESULT or driver stack location. A support engineer does. A developer cares even more.
If you mix these levels, you get one of two bad outcomes:
- users see meaningless technical messages
- logs lose the technical context needed for diagnosis
Example: vendor SDK throws native exception
Suppose the camera SDK throws this:
VendorCameraException: DeviceOpen failed. Error 0x889A0001. NodeMap unavailable.That should not leak directly to the operator UI.
At the machine adapter boundary, translate it into a domain-relevant or application-relevant fault.
public async Task<Result<CameraSession>> OpenCameraAsync(CancellationToken ct)
{
try
{
var handle = await _sdk.OpenAsync(ct);
return Result<CameraSession>.Success(new CameraSession(handle));
}
catch (VendorCameraException ex) when (ex.Code == VendorCameraErrorCodes.DeviceBusy)
{
_logger.LogWarning(ex,
"Camera open failed because device is busy. CameraId={CameraId}", _cameraId);
return Result<CameraSession>.Failure(
new ErrorDetail(
"Camera.Unavailable",
ex.Message,
ErrorCategory.Connectivity,
"Camera is not available. Check connection and whether another process is using it.",
new Dictionary<string, object?>
{
["CameraId"] = _cameraId,
["VendorCode"] = ex.Code
}));
}
}Why this separation matters
Because different layers need different language.
Infrastructure layer:
- precise technical details
- exception types
- error codes from external dependency
Application layer:
- meaningful categories
- retryability
- operator-safe wording
- business impact
UI layer:
- operator action guidance
- severity
- recoverability
- maybe translated/localized message
Logs and telemetry:
- full technical context
- correlation id
- adapter name
- step name
- external code
- stack trace if applicable
Good systems preserve all of these without confusing them.
7. Partial failure and degraded operation
Real systems rarely fail in perfectly binary ways.
Modeling partial success
Suppose inspection finishes, core measurements are valid, but 7 thumbnails fail to save due to disk pressure. You need a model that can say:
- workflow completed
- result set is valid
- some non-critical artifacts are missing
- warnings should be visible
- support should be able to trace exactly what degraded
That is not bool.
A realistic aggregate result
public sealed record InspectionCompletionResult(
string RunId,
bool CoreResultsSaved,
int ImagesCaptured,
int MainImagesSaved,
int ThumbnailsSaved,
bool TelemetryUploaded,
IReadOnlyList<ErrorDetail> Warnings,
IReadOnlyList<ErrorDetail> Errors)
{
public bool IsSuccess => CoreResultsSaved && Errors.Count == 0;
public bool IsCompletedWithWarnings => CoreResultsSaved && Warnings.Count > 0;
public bool IsPartialSuccess => CoreResultsSaved && (Warnings.Count > 0 || Errors.Count > 0);
}This kind of result tells the truth better.
Workflows that continue with warnings
A good orchestrator should know which failures are non-fatal.
foreach (var image in capturedImages)
{
var saveResult = await _imageSaver.SaveImageAsync(image, ct);
if (saveResult.IsFailure)
{
if (IsCritical(saveResult.Errors))
{
return AbortRun("Critical image persistence failure.", saveResult.Errors);
}
warnings.AddRange(saveResult.Errors);
continue;
}
warnings.AddRange(saveResult.Warnings);
}This is explicit. It is readable. It matches operational reality.
Degraded modes
Sometimes the system should enter degraded mode intentionally.
Examples:
- telemetry stream unavailable → continue without live dashboard
- optional analytics engine offline → continue with core inspection only
- secondary image annotation service down → continue and mark post-processing incomplete
Model this as a first-class state, not an accidental afterthought.
public enum OperationMode
{
Full,
Degraded,
SafeStop
}Then your workflow state can include mode and reasons.
Collecting multiple errors instead of failing immediately
Validation is the obvious example, but workflows also benefit from aggregation in the right places.
For example, during shutdown:
- motion stop failed on axis A
- telemetry flush failed
- one result file remained locked
You may want to collect all issues, not stop after the first, because shutdown diagnostics matter.
The key is to aggregate where it improves operator action or supportability, and fail fast where safety or correctness requires it.
8. Error propagation across layers
This is where mature design shows up.
A failure should not flow unchanged through every layer. It should be translated at boundaries so each layer sees what it needs.
A practical layered view
Infrastructure layer
Deals with:
- SDK exceptions
- IO exceptions
- database failures
- socket issues
- serialization failures
This layer often catches low-level exceptions only when it can add context or translate meaningfully. Otherwise it may let them bubble.
Machine adapter layer
Converts vendor-specific behavior into machine-relevant outcomes.
It knows that:
- vendor code 1042 means device busy
- timeout during command acknowledgement likely means lost communication
- certain faults are retryable
- certain faults should map to operator-facing machine states
Application/workflow layer
Decides:
- stop run or continue
- warning or hard failure
- retry or escalate
- update run state
- surface alarm
- record audit event
UI/ViewModel layer
Decides:
- what the operator sees
- whether to disable buttons
- whether to show modal error, banner, status line, or alarm panel
- whether technical details are hidden or available in diagnostics screen
Example flow
Vendor SDK throws timeout:
TimeoutException("Command ACK not received within 1500 ms")Machine adapter translates:
new ErrorDetail(
"Machine.Command.Timeout",
"Command ACK not received within 1500 ms",
ErrorCategory.Timeout,
"Machine did not respond in time.")Workflow layer evaluates:
- if this is a homing command, stop workflow
- if this is optional light-control refresh, retry once and continue if safe
UI layer shows:
- “Machine did not respond. Check machine connection and retry.”
Logging layer records:
- command name
- timeout duration
- machine state
- correlation id
- adapter operation
- original exception
That is good boundary translation.
Where to catch, where to rethrow, where to convert
Catch when:
- you can add important context
- you can translate to a meaningful domain/application result
- you can decide recovery or fallback
- you can preserve safety
Rethrow or allow bubbling when:
- the layer cannot handle it meaningfully
- it represents a programming/invariant failure
- translation would only hide important technical truth
Convert to result when:
- the caller is expected to branch on it
- the failure is part of normal operation
- you want explicit contract-driven handling
9. Async, pipelines, and failure contracts
Async code makes failure easier to lose.
That is one of the biggest real-world dangers.
Failure modeling in async methods
Async methods already use exceptions naturally through Task. That is useful, but also dangerous because it tempts teams to use exceptions for everything.
A good rule:
- expected outcomes: model explicitly in the returned result
- unexpected faults: let exceptions fault the task
Example:
public async Task<Result<InspectionFrame>> TryAcquireFrameAsync(CancellationToken ct)
{
if (!_machineState.IsAcquisitionReady)
{
return Result<InspectionFrame>.Failure(
new ErrorDetail(
"Acquire.InvalidState",
"Machine is not ready for acquisition.",
ErrorCategory.Domain,
"Machine is not ready to capture images."));
}
var frame = await _camera.AcquireAsync(ct); // unexpected SDK failures can still throw
return Result<InspectionFrame>.Success(frame);
}Failure propagation in Task-based flows
In orchestrations, it must be clear which failures:
- fault the whole task
- return as expected results
- are aggregated into warnings
- trigger cancellation of sibling operations
Without that clarity, async flows become impossible to reason about.
Channel/pipeline stage failure handling
In streaming pipelines, failures often happen inside background consumers:
- image save loop
- analytics stage
- telemetry stage
- result export stage
If a background loop throws and nobody observes it, the system may continue in a broken state silently.
That is extremely dangerous.
Example: hidden background save loop failure
Bad:
_ = Task.Run(async () =>
{
await foreach (var image in _channel.Reader.ReadAllAsync(ct))
{
await _imageSaver.SaveImageAsync(image, ct);
}
});If that task faults, the workflow may never know.
Better:
private Task? _saveLoopTask;
public void StartSaveLoop(CancellationToken ct)
{
_saveLoopTask = RunSaveLoopAsync(ct);
}
private async Task RunSaveLoopAsync(CancellationToken ct)
{
await foreach (var image in _channel.Reader.ReadAllAsync(ct))
{
var result = await _imageSaver.SaveImageAsync(image, ct);
if (result.IsFailure)
{
if (IsCritical(result.Errors))
{
throw new SavePipelineCriticalException(result.Errors);
}
_warningSink.Report(result.Errors);
}
if (result.Warnings.Count > 0)
_warningSink.Report(result.Warnings);
}
}Then the orchestrator explicitly observes the task:
try
{
await _saveLoopTask!;
}
catch (SavePipelineCriticalException ex)
{
_logger.LogError(ex, "Image save loop failed critically.");
await StopRunSafelyAsync();
throw;
}Partial pipeline failure vs full workflow cancellation
This is a key design choice.
Examples:
- thumbnail stage fails → continue
- core result persistence fails → cancel run
- monitoring loop throws → maybe switch to degraded mode and raise alarm
- PLC heartbeat lost → stop run immediately
The orchestrator should own this policy. Not every stage should decide alone.
Why hidden async failures are dangerous
Because the UI may still show “running,” but part of the system is dead.
That is worse than a visible crash. It is silent corruption of operational truth.
10. How we use this in .NET in practice
Here is the practical model I would recommend for many production .NET desktop systems.
Use exceptions for truly exceptional or unexpected failures
Examples:
- code bugs
- invariant violations
- unexpected third-party crashes
- impossible state transitions
- misuse of internal API contracts
Use Result-like types for expected outcomes
Examples:
- validation
- command rejection
- unavailable-but-handled machine state
- partial success
- warnings
- skip/continue decisions
Map low-level faults into meaningful application errors
At boundaries, convert technical exceptions into application-relevant or domain-relevant outcomes where appropriate.
Carry error codes and safe messages
Have stable codes. Codes matter for support, automation, and observability.
Examples:
Recipe.InvalidMachine.InvalidStateMachine.Command.TimeoutCamera.UnavailableImageSave.Thumbnail.Failed
Design APIs with explicit failure contracts
Some practical examples:
public interface IRecipeValidator
{
ValidationResult Validate(InspectionRecipe recipe);
}
public interface IMachineCommandService
{
Task<Result<StartInspectionOutcome>> StartInspectionAsync(
InspectionRecipe recipe,
CancellationToken ct);
Task<Result> StopInspectionAsync(CancellationToken ct);
}
public interface IImagePersistenceService
{
Task<Result<SaveImageOutcome>> SaveImageAsync(
CapturedImage image,
CancellationToken ct);
}
public interface IWorkflowStep
{
Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);
}This is much clearer than a mix of bool, void, random exceptions, and event-based side channels.
A more complete example
public sealed class InspectionWorkflow
{
private readonly IMachineCommandService _machine;
private readonly IImagePersistenceService _imagePersistence;
private readonly ILogger<InspectionWorkflow> _logger;
public InspectionWorkflow(
IMachineCommandService machine,
IImagePersistenceService imagePersistence,
ILogger<InspectionWorkflow> logger)
{
_machine = machine;
_imagePersistence = imagePersistence;
_logger = logger;
}
public async Task<Result<InspectionCompletionResult>> RunAsync(
InspectionRecipe recipe,
IReadOnlyList<CapturedImage> images,
CancellationToken ct)
{
var warnings = new List<ErrorDetail>();
var errors = new List<ErrorDetail>();
var startResult = await _machine.StartInspectionAsync(recipe, ct);
if (startResult.IsFailure)
{
return Result<InspectionCompletionResult>.Failure(startResult.Errors.ToArray());
}
foreach (var image in images)
{
var saveResult = await _imagePersistence.SaveImageAsync(image, ct);
if (saveResult.IsFailure)
{
if (saveResult.Errors.Any(e => e.Code == "ImageSave.Main.Failed"))
{
errors.AddRange(saveResult.Errors);
break;
}
warnings.AddRange(saveResult.Errors);
continue;
}
warnings.AddRange(saveResult.Warnings);
}
var completion = new InspectionCompletionResult(
RunId: startResult.Value!.RunId!,
CoreResultsSaved: errors.Count == 0,
ImagesCaptured: images.Count,
MainImagesSaved: images.Count - errors.Count,
ThumbnailsSaved: images.Count - warnings.Count(w => w.Code == "ImageSave.Thumbnail.Failed"),
TelemetryUploaded: true,
Warnings: warnings,
Errors: errors);
if (errors.Count > 0)
return Result<InspectionCompletionResult>.Failure(errors.ToArray());
return Result<InspectionCompletionResult>.Success(completion, warnings.ToArray());
}
}That is not toy-level. It reflects real workflow thinking.
11. Common mistakes
These mistakes are very common because teams often evolve failure handling reactively.
Throwing exceptions for normal validation failures
Why it happens:
- easy at first
- framework culture sometimes encourages exception-first style
- teams do not distinguish expected vs unexpected failure
What it causes:
- noisy logs
- harder control flow
- awkward UI handling
- validation treated like a crash path
Validation is usually not exceptional. It is an expected branch.
Swallowing exceptions and returning generic “failed”
Why it happens:
- fear of crashes
- rushed defensive coding
- desire to “keep system running”
Example:
catch (Exception)
{
return false;
}What it causes:
- lost diagnostic detail
- impossible support investigation
- hidden severity
- meaningless UI messaging
This is one of the worst patterns in production code.
Returning bool with no reason
Why it happens:
- simplicity
- legacy habits
- trying to avoid complexity
What it causes:
- opaque contracts
- caller confusion
- side-channel dependency
- inconsistent user messaging
- poor observability
bool is often too weak for important operations.
Leaking low-level technical errors directly to UI
Why it happens:
- shortcut from catch block to message box
- no translation layer
- internal exception text used as user communication
What it causes:
- operator confusion
- frightening or meaningless messages
- poor UX
- accidental exposure of irrelevant technical detail
Mixing domain failures and technical failures together
Why it happens:
- no error taxonomy
- ad hoc custom exceptions
- lack of architecture ownership
What it causes:
- retry logic becomes unreliable
- workflow stop/continue decisions become inconsistent
- hard-to-read code
Inconsistent result styles across the codebase
Examples:
- some methods throw
- some return bool
- some return null
- some return tuples
- some use custom Result
- some signal failures by events
This is chaos.
A large system needs conventions.
Hiding async/background failures
Why it happens:
- fire-and-forget tasks
- unobserved pipeline consumer faults
- background services without supervision
What it causes:
- silent data loss
- stale UI state
- partial dead system behavior
- very long debugging cycles
No structured error codes or categories
Why it happens:
- teams rely on free-form strings
- support needs were not considered up front
What it causes:
- impossible reporting aggregation
- weak support playbooks
- no stable contract for telemetry or alarm routing
12. Trade-offs
There is no free design.
Simplicity vs explicitness
A bool return is simple. A rich result is explicit.
The right choice depends on the importance and variability of failure.
For critical machine/workflow operations, explicitness usually wins.
Exception-based flow vs Result-based flow
Exception flow is concise for rare abnormal cases. Result flow is clearer for expected branching.
Use each where it fits. Overusing either creates pain.
Rich error models vs complexity
A very rich model can become heavy:
- too many types
- too much wrapping
- too much mapping code
A very weak model becomes ambiguous.
Experienced engineers aim for enough structure to preserve meaning, but not so much that every method becomes ceremony.
Preserving detail vs keeping APIs readable
Every result does not need twenty fields.
Keep the surface contract readable:
- code
- category
- operator-safe message
- maybe metadata
- warnings/errors collection where needed
Deeper technical detail can stay in logs or diagnostic context.
Consistency across system vs local optimization
One team may want a custom result per feature. Another wants one universal result type. Both extremes can be awkward.
Usually a good compromise is:
- a shared base error/result model
- specialized result payloads where the domain needs them
- documented conventions for when to throw vs return result
That gives consistency without flattening everything.
13. Designing good failure contracts
A good failure contract tells the truth about the operation.
What makes a good failure contract
It should tell the caller:
- what successful outcome looks like
- what expected failures look like
- whether partial success exists
- whether warnings can be returned
- whether exceptions still represent unexpected faults
- what the caller is expected to do
How callers know what to expect
The contract should be visible in the signature and naming.
Bad:
Task<bool> ExecuteAsync();Better:
Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);Much better.
When API should force explicit handling
If the failure is business-significant, the API should make it hard to ignore.
Validation is a good example.
ValidationResult Validate(InspectionRecipe recipe);This forces the caller to inspect validity and issues.
A machine start command is another example.
Task<Result<StartInspectionOutcome>> StartInspectionAsync(
InspectionRecipe recipe,
CancellationToken ct);The caller can no longer pretend that failure is just “maybe false.”
Examples
Machine service API
public interface IMachineService
{
Task<Result<MachineStatusSnapshot>> GetStatusAsync(CancellationToken ct);
Task<Result<StartInspectionOutcome>> StartInspectionAsync(InspectionRecipe recipe, CancellationToken ct);
Task<Result> StopAsync(CancellationToken ct);
}Expected command failures are explicit.
Workflow step API
public interface IWorkflowStep
{
Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);
}Better than exceptions for every skip, warning, or rejection.
Validation API
public interface IRecipeValidator
{
ValidationResult Validate(InspectionRecipe recipe);
}Do not make validation throw for normal invalid input.
Save pipeline API
public interface IImageSaver
{
Task<Result<SaveImageOutcome>> SaveAsync(CapturedImage image, CancellationToken ct);
}This supports partial success and warnings naturally.
14. Debugging and observability of result/field failures
One of the operational benefits of structured failure modeling is faster diagnosis.
How structured failure models help production debugging
If errors are modeled with codes and categories, support can quickly answer:
- what happened
- where it happened
- how often it happens
- which failures are operator errors vs system faults
- which are retryable vs fatal
That is much better than searching logs for text fragments.
How error codes and categories improve supportability
Examples:
Recipe.InvalidMachine.InvalidStateMachine.Command.TimeoutImageSave.Thumbnail.FailedCamera.Unavailable
These codes can drive:
- dashboards
- alert thresholds
- support runbooks
- alarm classifications
- trend analysis
Correlating operator-visible failures with logs and telemetry
A strong pattern is to include:
- operation/run id
- error code
- step name
- machine id
- recipe id
- timestamp
- correlation id
The operator may see:
Inspection completed with warnings. Preview images missing.
The log/telemetry can show:
- RunId: R-20260417-1422
- WarningCode:
ImageSave.Thumbnail.Failed - Count: 12
- Node: IPC-03
- DiskFreeMB: 142
- Step: ThumbnailGenerator
That correlation sharply reduces MTTR because support can go from symptom to cause much faster.
How experienced engineers use failure modeling to reduce MTTR
They design errors not just for code correctness, but for operations.
They ask:
- can support distinguish operator misuse from machine fault?
- can we count and trend this failure?
- can we tell whether degraded mode was entered?
- can we correlate the UI message to a stable code?
- can we tell whether retries happened and why?
That is mature engineering.
15. Senior engineer mental model
This is the main shift.
A senior engineer stops thinking of failure as “the thing that happens in catch.” They think of failure as part of the domain model.
In a real system:
- some negative outcomes are normal
- some are warnings
- some are partial success
- some require recovery
- some require operator action
- some should stop immediately
- some indicate a system defect
Those differences should appear in the design.
How experienced engineers think about expected outcomes vs true exceptions
They ask:
- Is this an expected possibility in normal operation?
- Does the caller need to branch on it?
- Is it safe or unsafe?
- Should it be visible in the signature?
- Is it a business/domain outcome or a technical fault?
- Does partial success matter here?
- What should the operator see?
- What should logs and telemetry retain?
If yes, it often belongs in a result model. If not, it may belong in exception flow.
How they keep error handling consistent across a large codebase
They establish conventions such as:
- exceptions for unexpected/programming/invariant failures
- result types for expected domain/application outcomes
- validation returns structured validation result
- adapter boundaries translate external faults into application-level errors
- operator-facing messages never come directly from raw exceptions
- background task failures must be observed and surfaced
- error codes are stable and structured
This consistency matters more than theoretical purity.
How they design APIs that are honest about failure
Honest APIs tell callers what can happen.
Dishonest APIs hide important outcomes behind:
- bool
- null
- generic exception
- side effects
- logs only
Good APIs make failure behavior discoverable and predictable.
How they keep failure understandable for both developers and operators
They separate views:
- technical detail for developers and logs
- meaningful categories for workflows
- safe, actionable wording for operators
That separation is one of the marks of production-grade design.
A practical recommendation for interview-level thinking
If I had to summarize the whole topic into one practical rule set for a senior/principal interview, I would say this:
Use exceptions for things that are truly abnormal, unexpected, or represent bugs or broken assumptions.
Use Result-style models for things that are expected parts of business, workflow, machine state, validation, partial success, and warnings.
Translate low-level technical faults into meaningful application/domain failures at boundaries.
Design failure contracts explicitly, especially in long-running workflows, machine commands, and background pipelines.
Preserve enough detail for logs, telemetry, support, and diagnosis, but keep operator messages clear, safe, and actionable.
And above all: do not treat all failures as the same kind of thing. In real systems, failure has shape. Good engineers model that shape clearly.
If you want, next I can turn this into an interview-ready version with likely follow-up questions and strong sample answers.