Advanced error and result modeling in modern .NET systems

When systems become real, failure stops being a side concern.

In simple demo code, failure often looks like one thing: “something went wrong.” In production systems, especially WPF desktop systems connected to machines, cameras, PLCs, storage, and long-running workflows, that is not good enough. Some failures are expected. Some are exceptional. Some should stop everything immediately. Some should only produce a warning. Some should be shown to the operator in simple language. Some should only be logged for engineers. Some mean “retry later.” Some mean “your workflow logic is wrong.” These are very different things, and a good system should model them differently.

That is why advanced error handling is really about failure modeling, not only try/catch.

1. Big picture — why failure modeling matters as much as failure handling

A lot of engineers think about failure only at the point where code breaks. They think in terms of catch blocks, logs, message boxes, and retries. But the deeper issue is earlier than that: how do we represent failure in the first place?

That design choice shapes the whole system.

If every kind of failure becomes an exception, then callers cannot easily tell the difference between:

“recipe validation failed because the user entered an invalid threshold”
“machine rejected the command because it is not homed yet”
“camera disconnected unexpectedly”
“image save partially failed but inspection still completed”
“developer bug: impossible state reached”

These are not the same. They should not look the same in code. They should not be handled the same way in the UI. They should not be logged with the same severity. They should not trigger the same operational response.

In production systems, callers need a clear contract:

What can fail?
Is that failure expected or unexpected?
Is the caller supposed to handle it explicitly?
Can the workflow continue?
Should the operator intervene?
Should the system retry automatically?
Is the failure safe, unsafe, temporary, or fatal?

That is why failure modeling matters as much as failure handling. If the contract is vague, the code becomes vague. And vague code around failure becomes operational pain.

Real examples

A machine command execution service might expose this:

csharp

Task<bool> StartInspectionAsync(CancellationToken ct);

This tells the caller almost nothing.

Did false mean:

machine is disconnected?
machine is busy?
recipe is invalid?
safety interlock is active?
timeout?
SDK threw?
operation cancelled?
command rejected because state is wrong?

The caller now has to guess, or depend on side channels like logs, out parameters, global state, or message events. That is weak design.

A better API tells the truth:

csharp

Task<Result<StartInspectionOutcome, MachineCommandError>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

Now the contract says: this operation has a success outcome, and expected command-related failures are modeled explicitly. That makes the system more honest.

2. Different kinds of failure

One of the biggest design improvements a senior engineer brings is the ability to distinguish failure types instead of flattening them.

Unexpected technical failures

These are failures the normal caller is not expected to handle as part of regular business flow.

Examples:

null reference because of a bug
corrupted internal state
invalid cast
race condition causing impossible state
driver crash
unmanaged SDK access violation wrapped by adapter boundary
disk subsystem throwing unexpected IO exception outside known behavior

These are usually exception territory.

They represent either:

a programming error
an infrastructure fault outside the normal business contract
a system integrity issue

Expected business or domain failures

These are failures that are normal outcomes in the domain.

Examples:

recipe is invalid
machine is in wrong state for command
command rejected because door is open
inspection cannot start because wafer is not loaded
workflow step cannot proceed because preconditions are not satisfied

These are often better modeled as results, not exceptions.

They are not “surprising system breakage.” They are expected possibilities.

Validation errors

Validation errors deserve their own category because they are usually not runtime faults at all. They are input-quality problems.

Examples:

threshold is out of allowed range
required calibration file missing from recipe
scan area overlaps invalid region
operator entered negative exposure time
recipe references camera profile that does not exist

Validation is often multi-error by nature. A good validation model can return all issues together rather than failing on the first one.

Recoverable vs unrecoverable failures

This distinction matters operationally.

Recoverable:

transient network issue
temporary file lock
telemetry stream dropped but can reconnect
command timeout where retry is safe
image thumbnail save failed, core image save succeeded

Unrecoverable:

safety state violation
machine axis control unavailable
corrupted results package
invariant broken inside workflow engine
persistent camera initialization failure required for run correctness

The design question is not only “did it fail?” but also “what is the safe next action?”

Warnings vs hard failures

Warnings mean the operation is still considered complete enough to proceed, but something important should be recorded or shown.

Examples:

inspection completed but some thumbnails were not generated
data archived locally but cloud upload deferred
optional telemetry unavailable
recipe used fallback calibration

Warnings are very important in industrial systems because many workflows should continue in degraded mode instead of failing completely.

Partial success

This is one of the most under-modeled areas.

Examples:

98 images saved, 2 thumbnails failed
inspection completed, but one auxiliary statistics export failed
report generated, but one optional annotation layer missing
wafer scanned, but one region had low-confidence classification and was flagged

If your system models everything as either success or failure, you lose important reality. Production systems often produce mixed outcomes.

Why treating all of these the same leads to poor design

If all of them become exceptions:

business logic gets buried in catch blocks
logs become noisy
expected operator actions look like crashes
retries become inconsistent
UI messages become technical
workflows become fragile

If all of them become bool:

callers lose meaning
support cannot diagnose quickly
monitoring becomes weak
partial success disappears
warnings get lost
unsafe conditions may be ignored

The whole point of good failure modeling is to preserve meaning.

3. Real problems in a WPF desktop app controlling a wafer inspection machine

This kind of system is exactly where weak failure contracts cause chaos.

Imagine a WPF application controlling a wafer inspection machine. It coordinates:

operator UI
recipe loading and validation
machine state
motion control commands
camera/image acquisition
processing pipeline
result storage
alarms/logging
long-running workflows

Now look at what can go wrong.

Machine command may fail for very different reasons

A StartInspectionAsync operation may fail because:

machine is disconnected
machine is not initialized
machine is already running
machine is in alarm state
safety door is open
vendor SDK timed out
SDK threw native exception
cancellation requested
command acknowledged but completion event never arrived

These have different meanings.

The UI should not show the same message for all of them.

The workflow should not respond the same way either. “Machine not homed” is an operator-correctable domain issue. “Access violation in vendor camera SDK” is an engineering-level technical fault.

Workflow may partially complete

A run may successfully inspect wafers but fail to:

save some preview images
upload telemetry
archive some raw debug traces
enrich secondary analytics
write optional audit attachment

That should not necessarily invalidate the whole run.

If your workflow engine only supports “success” or “throw,” it becomes too brittle.

UI needs operator-friendly messages while logs retain technical details

The operator needs something like:

Camera not available. Check connection and power, then retry.

The log needs something like:

VendorCameraException HRESULT=0x8007001F during InitializeCamera on CameraAdapter.InitializeAsync. Serial=CAM-07. NativeCode=DeviceBusy. DriverVersion=5.2.13.

Those are different views of the same incident. Good error modeling supports both.

Data pipeline may continue with degraded behavior

Suppose thumbnail generation fails because the GPU helper process crashes. The main inspection results are still valid and the machine can keep running. The workflow should continue, record a warning, and surface that the result package is degraded.

This is much better than either:

crashing the whole run, or
silently hiding the issue

Some failures should stop the run immediately

Examples:

stage lost synchronization
axis controller reported unsafe motion state
core result data cannot be persisted
recipe integrity invalidates measurement correctness
emergency stop triggered

These are fail-fast conditions.

Others should only mark warnings

Examples:

preview generation failed
optional telemetry unavailable
thumbnail save retried and still failed
background diagnostics export failed

These should not automatically stop production.

This is why failure contracts must be explicit. In this domain, failure is part of the workflow model.

4. Exceptions vs result-style modeling

This is where teams often become dogmatic.

Some teams say: “exceptions are bad; use Result everywhere.” Others say: “C# already has exceptions; use them for everything.”

Both extremes are usually wrong.

When exceptions are the right tool

Exceptions are the right tool when something happened that the caller is not expected to model as a routine branch.

Good examples:

programming bugs
invariant violations
impossible states
unexpected SDK crashes
serialization bug
null where contract guaranteed non-null
infrastructure fault outside the normal operation contract

Examples:

csharp

public async Task<InspectionPlan> BuildPlanAsync(Recipe recipe, CancellationToken ct)
{
    if (recipe is null) throw new ArgumentNullException(nameof(recipe));

    var machineConfig = await _configProvider.GetCurrentAsync(ct);
    if (machineConfig.AxisCount <= 0)
        throw new InvalidOperationException("Machine configuration is invalid.");

    // ...
}

This is fine. These are not “expected operator outcomes.”

When a Result-style return model is better

A result model is better when the caller should explicitly handle a known, normal possibility.

Examples:

validation failure
command rejection because state is wrong
business rule not satisfied
workflow step skipped
partial completion with warnings
optional operation failed but system can continue

Example:

csharp

public Task<Result<Unit, RecipeValidationError[]>> ValidateAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

That tells the caller: validation issues are expected, and you should handle them explicitly.

Why expected failures are often better as return values

Because they are part of the contract.

If the machine can validly reject a command because it is not in the correct state, that is not exceptional. It is a normal branch. Modeling it as an exception often pushes domain logic into technical control flow.

Example:

csharp

var result = await _machineService.StartInspectionAsync(recipe, ct);

if (result.IsFailure)
{
    switch (result.Error.Code)
    {
        case MachineCommandErrorCode.InvalidState:
            ShowOperatorMessage(result.Error.OperatorMessage);
            return;
        case MachineCommandErrorCode.MachineDisconnected:
            ShowReconnectPrompt();
            return;
        default:
            // escalate or fallback
            break;
    }
}

This is clearer than catching five different custom exceptions for normal machine rejection cases.

Why unexpected failures are often better as exceptions

Because they should travel fast, preserve stack information, and signal abnormal conditions clearly.

If the internal workflow engine reaches an impossible state, returning Result.Fail("unexpected") often hides the severity.

Example:

csharp

if (!_stateMachine.CanTransition(currentState, trigger))
    throw new InvalidOperationException(
        $"Invalid workflow transition from {currentState} using {trigger}.");

That is a system defect, not an expected domain outcome.

Trade-offs

Exceptions:

good for abnormal faults
preserve stack traces
integrate naturally with async/await
bad when overused for routine outcomes
can make expected failures invisible in signatures

Results:

make expected failure explicit
improve contract clarity
good for validation, domain rules, partial success
can become verbose
can lead to “plumbing fatigue” if overused everywhere

Experienced engineers do not choose one universal rule. They choose based on whether the failure is part of normal operation.

5. Result pattern in practice

A result pattern is simply a structured way to return outcome information without using exceptions for normal branches.

That sounds simple, but the important part is what you put into the result model.

A weak result model:

csharp

public sealed class Result
{
    public bool Success { get; init; }
    public string? Error { get; init; }
}

This is not enough for production systems.

A stronger model usually needs:

success/failure state
error code/category
operator-friendly message
technical details or metadata
warning collection
partial success support
maybe a typed success payload

Here is a realistic base model.

csharp

public enum ErrorCategory
{
    Validation,
    Domain,
    Technical,
    Timeout,
    Connectivity,
    Safety,
    Concurrency,
    Unexpected
}

public sealed record ErrorDetail(
    string Code,
    string Message,
    ErrorCategory Category,
    string? OperatorMessage = null,
    IReadOnlyDictionary<string, object?>? Metadata = null);

public sealed class Result
{
    private Result(bool isSuccess, IReadOnlyList<ErrorDetail> errors, IReadOnlyList<ErrorDetail> warnings)
    {
        IsSuccess = isSuccess;
        Errors = errors;
        Warnings = warnings;
    }

    public bool IsSuccess { get; }
    public bool IsFailure => !IsSuccess;
    public IReadOnlyList<ErrorDetail> Errors { get; }
    public IReadOnlyList<ErrorDetail> Warnings { get; }

    public static Result Success(params ErrorDetail[] warnings) =>
        new(true, Array.Empty<ErrorDetail>(), warnings);

    public static Result Failure(params ErrorDetail[] errors) =>
        new(false, errors, Array.Empty<ErrorDetail>());
}

public sealed class Result<T>
{
    private Result(bool isSuccess, T? value, IReadOnlyList<ErrorDetail> errors, IReadOnlyList<ErrorDetail> warnings)
    {
        IsSuccess = isSuccess;
        Value = value;
        Errors = errors;
        Warnings = warnings;
    }

    public bool IsSuccess { get; }
    public bool IsFailure => !IsSuccess;
    public T? Value { get; }
    public IReadOnlyList<ErrorDetail> Errors { get; }
    public IReadOnlyList<ErrorDetail> Warnings { get; }

    public static Result<T> Success(T value, params ErrorDetail[] warnings) =>
        new(true, value, Array.Empty<ErrorDetail>(), warnings);

    public static Result<T> Failure(params ErrorDetail[] errors) =>
        new(false, default, errors, Array.Empty<ErrorDetail>());
}

That is still compact, but much more usable.

Example: ValidationResult

Validation often returns multiple problems.

csharp

public sealed record ValidationIssue(
    string Code,
    string Field,
    string Message,
    string? SuggestedFix = null);

public sealed class ValidationResult
{
    private ValidationResult(IReadOnlyList<ValidationIssue> issues)
    {
        Issues = issues;
    }

    public IReadOnlyList<ValidationIssue> Issues { get; }
    public bool IsValid => Issues.Count == 0;

    public static ValidationResult Valid() => new(Array.Empty<ValidationIssue>());

    public static ValidationResult Invalid(params ValidationIssue[] issues) => new(issues);
}

Usage:

csharp

public ValidationResult ValidateRecipe(InspectionRecipe recipe)
{
    var issues = new List<ValidationIssue>();

    if (recipe.ExposureTimeMs <= 0)
        issues.Add(new("Recipe.Exposure.Invalid", "ExposureTimeMs", "Exposure time must be greater than zero."));

    if (string.IsNullOrWhiteSpace(recipe.CameraProfile))
        issues.Add(new("Recipe.CameraProfile.Missing", "CameraProfile", "Camera profile is required."));

    if (recipe.ScanRegions.Count == 0)
        issues.Add(new("Recipe.ScanRegions.Empty", "ScanRegions", "At least one scan region is required."));

    return issues.Count == 0
        ? ValidationResult.Valid()
        : ValidationResult.Invalid(issues.ToArray());
}

Example: StartInspectionResult

Sometimes a dedicated outcome type is even clearer than a generic result.

csharp

public enum StartInspectionStatus
{
    Started,
    Rejected,
    Warning
}

public sealed record StartInspectionOutcome(
    StartInspectionStatus Status,
    string? RunId,
    IReadOnlyList<ErrorDetail> Warnings);

public sealed record MachineCommandError(
    string Code,
    string Message,
    string OperatorMessage,
    bool Retryable,
    bool SafeToRetry,
    MachineCommandErrorCode ErrorCode);

public enum MachineCommandErrorCode
{
    InvalidState,
    MachineDisconnected,
    Timeout,
    AlarmActive,
    SafetyInterlock,
    RecipeInvalid
}

Service contract:

csharp

Task<Result<StartInspectionOutcome>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

Implementation sketch:

csharp

public async Task<Result<StartInspectionOutcome>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct)
{
    var validation = _recipeValidator.ValidateRecipe(recipe);
    if (!validation.IsValid)
    {
        return Result<StartInspectionOutcome>.Failure(
            new ErrorDetail(
                "Recipe.Invalid",
                "Recipe validation failed.",
                ErrorCategory.Validation,
                "Recipe is invalid. Review highlighted fields."));
    }

    if (!_machineState.CanStartInspection)
    {
        return Result<StartInspectionOutcome>.Failure(
            new ErrorDetail(
                "Machine.InvalidState",
                $"Machine state '{_machineState.Current}' does not allow StartInspection.",
                ErrorCategory.Domain,
                "Machine is not ready to start inspection.",
                new Dictionary<string, object?> { ["MachineState"] = _machineState.Current }));
    }

    try
    {
        var runId = await _machineAdapter.StartInspectionAsync(recipe, ct);
        return Result<StartInspectionOutcome>.Success(
            new StartInspectionOutcome(StartInspectionStatus.Started, runId, Array.Empty<ErrorDetail>()));
    }
    catch (OperationCanceledException)
    {
        throw;
    }
    catch (TimeoutException ex)
    {
        return Result<StartInspectionOutcome>.Failure(
            new ErrorDetail(
                "Machine.Start.Timeout",
                ex.Message,
                ErrorCategory.Timeout,
                "Machine did not respond in time. Retry after checking connection."));
    }
}

Example: SaveImageResult with warnings and partial success

csharp

public sealed record SaveImageOutcome(
    string ImageId,
    string MainPath,
    string? ThumbnailPath,
    bool ThumbnailSaved);

public async Task<Result<SaveImageOutcome>> SaveImageAsync(
    CapturedImage image,
    CancellationToken ct)
{
    var warnings = new List<ErrorDetail>();

    string mainPath;
    try
    {
        mainPath = await _imageStore.SaveMainImageAsync(image, ct);
    }
    catch (IOException ex)
    {
        return Result<SaveImageOutcome>.Failure(
            new ErrorDetail(
                "ImageSave.Main.Failed",
                ex.Message,
                ErrorCategory.Technical,
                "Failed to save image data.",
                new Dictionary<string, object?> { ["ImageId"] = image.Id }));
    }

    string? thumbnailPath = null;
    bool thumbnailSaved = false;

    try
    {
        thumbnailPath = await _imageStore.SaveThumbnailAsync(image, ct);
        thumbnailSaved = true;
    }
    catch (Exception ex)
    {
        warnings.Add(new ErrorDetail(
            "ImageSave.Thumbnail.Failed",
            ex.Message,
            ErrorCategory.Technical,
            "Preview thumbnail could not be generated.",
            new Dictionary<string, object?> { ["ImageId"] = image.Id }));
    }

    return Result<SaveImageOutcome>.Success(
        new SaveImageOutcome(image.Id, mainPath, thumbnailPath, thumbnailSaved),
        warnings.ToArray());
}

This is a realistic production pattern: core operation succeeded, but some secondary work degraded.

Example: WorkflowStepResult

csharp

public enum WorkflowStepStatus
{
    Completed,
    Skipped,
    Failed,
    CompletedWithWarnings
}

public sealed record WorkflowStepResult(
    string StepName,
    WorkflowStepStatus Status,
    IReadOnlyList<ErrorDetail> Errors,
    IReadOnlyList<ErrorDetail> Warnings,
    TimeSpan Duration);

This is much better than bool ExecuteStep() for workflow orchestration.

6. Domain errors vs technical errors

This separation is crucial.

A machine operator does not care about a native HRESULT or driver stack location. A support engineer does. A developer cares even more.

If you mix these levels, you get one of two bad outcomes:

users see meaningless technical messages
logs lose the technical context needed for diagnosis

Example: vendor SDK throws native exception

Suppose the camera SDK throws this:

csharp

VendorCameraException: DeviceOpen failed. Error 0x889A0001. NodeMap unavailable.

That should not leak directly to the operator UI.

At the machine adapter boundary, translate it into a domain-relevant or application-relevant fault.

csharp

public async Task<Result<CameraSession>> OpenCameraAsync(CancellationToken ct)
{
    try
    {
        var handle = await _sdk.OpenAsync(ct);
        return Result<CameraSession>.Success(new CameraSession(handle));
    }
    catch (VendorCameraException ex) when (ex.Code == VendorCameraErrorCodes.DeviceBusy)
    {
        _logger.LogWarning(ex,
            "Camera open failed because device is busy. CameraId={CameraId}", _cameraId);

        return Result<CameraSession>.Failure(
            new ErrorDetail(
                "Camera.Unavailable",
                ex.Message,
                ErrorCategory.Connectivity,
                "Camera is not available. Check connection and whether another process is using it.",
                new Dictionary<string, object?>
                {
                    ["CameraId"] = _cameraId,
                    ["VendorCode"] = ex.Code
                }));
    }
}

Why this separation matters

Because different layers need different language.

Infrastructure layer:

precise technical details
exception types
error codes from external dependency

Application layer:

meaningful categories
retryability
operator-safe wording
business impact

UI layer:

operator action guidance
severity
recoverability
maybe translated/localized message

Logs and telemetry:

full technical context
correlation id
adapter name
step name
external code
stack trace if applicable

Good systems preserve all of these without confusing them.

7. Partial failure and degraded operation

Real systems rarely fail in perfectly binary ways.

Modeling partial success

Suppose inspection finishes, core measurements are valid, but 7 thumbnails fail to save due to disk pressure. You need a model that can say:

workflow completed
result set is valid
some non-critical artifacts are missing
warnings should be visible
support should be able to trace exactly what degraded

That is not bool.

A realistic aggregate result

csharp

public sealed record InspectionCompletionResult(
    string RunId,
    bool CoreResultsSaved,
    int ImagesCaptured,
    int MainImagesSaved,
    int ThumbnailsSaved,
    bool TelemetryUploaded,
    IReadOnlyList<ErrorDetail> Warnings,
    IReadOnlyList<ErrorDetail> Errors)
{
    public bool IsSuccess => CoreResultsSaved && Errors.Count == 0;
    public bool IsCompletedWithWarnings => CoreResultsSaved && Warnings.Count > 0;
    public bool IsPartialSuccess => CoreResultsSaved && (Warnings.Count > 0 || Errors.Count > 0);
}

This kind of result tells the truth better.

Workflows that continue with warnings

A good orchestrator should know which failures are non-fatal.

csharp

foreach (var image in capturedImages)
{
    var saveResult = await _imageSaver.SaveImageAsync(image, ct);

    if (saveResult.IsFailure)
    {
        if (IsCritical(saveResult.Errors))
        {
            return AbortRun("Critical image persistence failure.", saveResult.Errors);
        }

        warnings.AddRange(saveResult.Errors);
        continue;
    }

    warnings.AddRange(saveResult.Warnings);
}

This is explicit. It is readable. It matches operational reality.

Degraded modes

Sometimes the system should enter degraded mode intentionally.

Examples:

telemetry stream unavailable → continue without live dashboard
optional analytics engine offline → continue with core inspection only
secondary image annotation service down → continue and mark post-processing incomplete

Model this as a first-class state, not an accidental afterthought.

csharp

public enum OperationMode
{
    Full,
    Degraded,
    SafeStop
}

Then your workflow state can include mode and reasons.

Collecting multiple errors instead of failing immediately

Validation is the obvious example, but workflows also benefit from aggregation in the right places.

For example, during shutdown:

motion stop failed on axis A
telemetry flush failed
one result file remained locked

You may want to collect all issues, not stop after the first, because shutdown diagnostics matter.

The key is to aggregate where it improves operator action or supportability, and fail fast where safety or correctness requires it.

8. Error propagation across layers

This is where mature design shows up.

A failure should not flow unchanged through every layer. It should be translated at boundaries so each layer sees what it needs.

A practical layered view

Infrastructure layer

Deals with:

SDK exceptions
IO exceptions
database failures
socket issues
serialization failures

This layer often catches low-level exceptions only when it can add context or translate meaningfully. Otherwise it may let them bubble.

Machine adapter layer

Converts vendor-specific behavior into machine-relevant outcomes.

It knows that:

vendor code 1042 means device busy
timeout during command acknowledgement likely means lost communication
certain faults are retryable
certain faults should map to operator-facing machine states

Application/workflow layer

Decides:

stop run or continue
warning or hard failure
retry or escalate
update run state
surface alarm
record audit event

UI/ViewModel layer

Decides:

what the operator sees
whether to disable buttons
whether to show modal error, banner, status line, or alarm panel
whether technical details are hidden or available in diagnostics screen

Example flow

Vendor SDK throws timeout:

csharp

TimeoutException("Command ACK not received within 1500 ms")

Machine adapter translates:

csharp

new ErrorDetail(
    "Machine.Command.Timeout",
    "Command ACK not received within 1500 ms",
    ErrorCategory.Timeout,
    "Machine did not respond in time.")

Workflow layer evaluates:

if this is a homing command, stop workflow
if this is optional light-control refresh, retry once and continue if safe

UI layer shows:

“Machine did not respond. Check machine connection and retry.”

Logging layer records:

command name
timeout duration
machine state
correlation id
adapter operation
original exception

That is good boundary translation.

Where to catch, where to rethrow, where to convert

Catch when:

you can add important context
you can translate to a meaningful domain/application result
you can decide recovery or fallback
you can preserve safety

Rethrow or allow bubbling when:

the layer cannot handle it meaningfully
it represents a programming/invariant failure
translation would only hide important technical truth

Convert to result when:

the caller is expected to branch on it
the failure is part of normal operation
you want explicit contract-driven handling

9. Async, pipelines, and failure contracts

Async code makes failure easier to lose.

That is one of the biggest real-world dangers.

Failure modeling in async methods

Async methods already use exceptions naturally through Task. That is useful, but also dangerous because it tempts teams to use exceptions for everything.

A good rule:

expected outcomes: model explicitly in the returned result
unexpected faults: let exceptions fault the task

Example:

csharp

public async Task<Result<InspectionFrame>> TryAcquireFrameAsync(CancellationToken ct)
{
    if (!_machineState.IsAcquisitionReady)
    {
        return Result<InspectionFrame>.Failure(
            new ErrorDetail(
                "Acquire.InvalidState",
                "Machine is not ready for acquisition.",
                ErrorCategory.Domain,
                "Machine is not ready to capture images."));
    }

    var frame = await _camera.AcquireAsync(ct); // unexpected SDK failures can still throw
    return Result<InspectionFrame>.Success(frame);
}

Failure propagation in Task-based flows

In orchestrations, it must be clear which failures:

fault the whole task
return as expected results
are aggregated into warnings
trigger cancellation of sibling operations

Without that clarity, async flows become impossible to reason about.

Channel/pipeline stage failure handling

In streaming pipelines, failures often happen inside background consumers:

image save loop
analytics stage
telemetry stage
result export stage

If a background loop throws and nobody observes it, the system may continue in a broken state silently.

That is extremely dangerous.

Example: hidden background save loop failure

Bad:

csharp

_ = Task.Run(async () =>
{
    await foreach (var image in _channel.Reader.ReadAllAsync(ct))
    {
        await _imageSaver.SaveImageAsync(image, ct);
    }
});

If that task faults, the workflow may never know.

Better:

csharp

private Task? _saveLoopTask;

public void StartSaveLoop(CancellationToken ct)
{
    _saveLoopTask = RunSaveLoopAsync(ct);
}

private async Task RunSaveLoopAsync(CancellationToken ct)
{
    await foreach (var image in _channel.Reader.ReadAllAsync(ct))
    {
        var result = await _imageSaver.SaveImageAsync(image, ct);

        if (result.IsFailure)
        {
            if (IsCritical(result.Errors))
            {
                throw new SavePipelineCriticalException(result.Errors);
            }

            _warningSink.Report(result.Errors);
        }

        if (result.Warnings.Count > 0)
            _warningSink.Report(result.Warnings);
    }
}

Then the orchestrator explicitly observes the task:

csharp

try
{
    await _saveLoopTask!;
}
catch (SavePipelineCriticalException ex)
{
    _logger.LogError(ex, "Image save loop failed critically.");
    await StopRunSafelyAsync();
    throw;
}

Partial pipeline failure vs full workflow cancellation

This is a key design choice.

Examples:

thumbnail stage fails → continue
core result persistence fails → cancel run
monitoring loop throws → maybe switch to degraded mode and raise alarm
PLC heartbeat lost → stop run immediately

The orchestrator should own this policy. Not every stage should decide alone.

Why hidden async failures are dangerous

Because the UI may still show “running,” but part of the system is dead.

That is worse than a visible crash. It is silent corruption of operational truth.

10. How we use this in .NET in practice

Here is the practical model I would recommend for many production .NET desktop systems.

Use exceptions for truly exceptional or unexpected failures

Examples:

code bugs
invariant violations
unexpected third-party crashes
impossible state transitions
misuse of internal API contracts

Use Result-like types for expected outcomes

Examples:

validation
command rejection
unavailable-but-handled machine state
partial success
warnings
skip/continue decisions

Map low-level faults into meaningful application errors

At boundaries, convert technical exceptions into application-relevant or domain-relevant outcomes where appropriate.

Carry error codes and safe messages

Have stable codes. Codes matter for support, automation, and observability.

Examples:

Recipe.Invalid
Machine.InvalidState
Machine.Command.Timeout
Camera.Unavailable
ImageSave.Thumbnail.Failed

Design APIs with explicit failure contracts

Some practical examples:

csharp

public interface IRecipeValidator
{
    ValidationResult Validate(InspectionRecipe recipe);
}

public interface IMachineCommandService
{
    Task<Result<StartInspectionOutcome>> StartInspectionAsync(
        InspectionRecipe recipe,
        CancellationToken ct);

    Task<Result> StopInspectionAsync(CancellationToken ct);
}

public interface IImagePersistenceService
{
    Task<Result<SaveImageOutcome>> SaveImageAsync(
        CapturedImage image,
        CancellationToken ct);
}

public interface IWorkflowStep
{
    Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);
}

This is much clearer than a mix of bool, void, random exceptions, and event-based side channels.

A more complete example

csharp

public sealed class InspectionWorkflow
{
    private readonly IMachineCommandService _machine;
    private readonly IImagePersistenceService _imagePersistence;
    private readonly ILogger<InspectionWorkflow> _logger;

    public InspectionWorkflow(
        IMachineCommandService machine,
        IImagePersistenceService imagePersistence,
        ILogger<InspectionWorkflow> logger)
    {
        _machine = machine;
        _imagePersistence = imagePersistence;
        _logger = logger;
    }

    public async Task<Result<InspectionCompletionResult>> RunAsync(
        InspectionRecipe recipe,
        IReadOnlyList<CapturedImage> images,
        CancellationToken ct)
    {
        var warnings = new List<ErrorDetail>();
        var errors = new List<ErrorDetail>();

        var startResult = await _machine.StartInspectionAsync(recipe, ct);
        if (startResult.IsFailure)
        {
            return Result<InspectionCompletionResult>.Failure(startResult.Errors.ToArray());
        }

        foreach (var image in images)
        {
            var saveResult = await _imagePersistence.SaveImageAsync(image, ct);

            if (saveResult.IsFailure)
            {
                if (saveResult.Errors.Any(e => e.Code == "ImageSave.Main.Failed"))
                {
                    errors.AddRange(saveResult.Errors);
                    break;
                }

                warnings.AddRange(saveResult.Errors);
                continue;
            }

            warnings.AddRange(saveResult.Warnings);
        }

        var completion = new InspectionCompletionResult(
            RunId: startResult.Value!.RunId!,
            CoreResultsSaved: errors.Count == 0,
            ImagesCaptured: images.Count,
            MainImagesSaved: images.Count - errors.Count,
            ThumbnailsSaved: images.Count - warnings.Count(w => w.Code == "ImageSave.Thumbnail.Failed"),
            TelemetryUploaded: true,
            Warnings: warnings,
            Errors: errors);

        if (errors.Count > 0)
            return Result<InspectionCompletionResult>.Failure(errors.ToArray());

        return Result<InspectionCompletionResult>.Success(completion, warnings.ToArray());
    }
}

That is not toy-level. It reflects real workflow thinking.

11. Common mistakes

These mistakes are very common because teams often evolve failure handling reactively.

Throwing exceptions for normal validation failures

Why it happens:

easy at first
framework culture sometimes encourages exception-first style
teams do not distinguish expected vs unexpected failure

What it causes:

noisy logs
harder control flow
awkward UI handling
validation treated like a crash path

Validation is usually not exceptional. It is an expected branch.

Swallowing exceptions and returning generic “failed”

Why it happens:

fear of crashes
rushed defensive coding
desire to “keep system running”

Example:

csharp

catch (Exception)
{
    return false;
}

What it causes:

lost diagnostic detail
impossible support investigation
hidden severity
meaningless UI messaging

This is one of the worst patterns in production code.

Returning bool with no reason

Why it happens:

simplicity
legacy habits
trying to avoid complexity

What it causes:

opaque contracts
caller confusion
side-channel dependency
inconsistent user messaging
poor observability

bool is often too weak for important operations.

Leaking low-level technical errors directly to UI

Why it happens:

shortcut from catch block to message box
no translation layer
internal exception text used as user communication

What it causes:

operator confusion
frightening or meaningless messages
poor UX
accidental exposure of irrelevant technical detail

Mixing domain failures and technical failures together

Why it happens:

no error taxonomy
ad hoc custom exceptions
lack of architecture ownership

What it causes:

retry logic becomes unreliable
workflow stop/continue decisions become inconsistent
hard-to-read code

Inconsistent result styles across the codebase

Examples:

some methods throw
some return bool
some return null
some return tuples
some use custom Result
some signal failures by events

This is chaos.

A large system needs conventions.

Hiding async/background failures

Why it happens:

fire-and-forget tasks
unobserved pipeline consumer faults
background services without supervision

What it causes:

silent data loss
stale UI state
partial dead system behavior
very long debugging cycles

No structured error codes or categories

Why it happens:

teams rely on free-form strings
support needs were not considered up front

What it causes:

impossible reporting aggregation
weak support playbooks
no stable contract for telemetry or alarm routing

12. Trade-offs

There is no free design.

Simplicity vs explicitness

A bool return is simple. A rich result is explicit.

The right choice depends on the importance and variability of failure.

For critical machine/workflow operations, explicitness usually wins.

Exception-based flow vs Result-based flow

Exception flow is concise for rare abnormal cases. Result flow is clearer for expected branching.

Use each where it fits. Overusing either creates pain.

Rich error models vs complexity

A very rich model can become heavy:

too many types
too much wrapping
too much mapping code

A very weak model becomes ambiguous.

Experienced engineers aim for enough structure to preserve meaning, but not so much that every method becomes ceremony.

Preserving detail vs keeping APIs readable

Every result does not need twenty fields.

Keep the surface contract readable:

code
category
operator-safe message
maybe metadata
warnings/errors collection where needed

Deeper technical detail can stay in logs or diagnostic context.

Consistency across system vs local optimization

One team may want a custom result per feature. Another wants one universal result type. Both extremes can be awkward.

Usually a good compromise is:

a shared base error/result model
specialized result payloads where the domain needs them
documented conventions for when to throw vs return result

That gives consistency without flattening everything.

13. Designing good failure contracts

A good failure contract tells the truth about the operation.

What makes a good failure contract

It should tell the caller:

what successful outcome looks like
what expected failures look like
whether partial success exists
whether warnings can be returned
whether exceptions still represent unexpected faults
what the caller is expected to do

How callers know what to expect

The contract should be visible in the signature and naming.

Bad:

csharp

Task<bool> ExecuteAsync();

Better:

csharp

Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);

Much better.

When API should force explicit handling

If the failure is business-significant, the API should make it hard to ignore.

Validation is a good example.

csharp

ValidationResult Validate(InspectionRecipe recipe);

This forces the caller to inspect validity and issues.

A machine start command is another example.

csharp

Task<Result<StartInspectionOutcome>> StartInspectionAsync(
    InspectionRecipe recipe,
    CancellationToken ct);

The caller can no longer pretend that failure is just “maybe false.”

Examples

Machine service API

csharp

public interface IMachineService
{
    Task<Result<MachineStatusSnapshot>> GetStatusAsync(CancellationToken ct);
    Task<Result<StartInspectionOutcome>> StartInspectionAsync(InspectionRecipe recipe, CancellationToken ct);
    Task<Result> StopAsync(CancellationToken ct);
}

Expected command failures are explicit.

Workflow step API

csharp

public interface IWorkflowStep
{
    Task<WorkflowStepResult> ExecuteAsync(WorkflowContext context, CancellationToken ct);
}

Better than exceptions for every skip, warning, or rejection.

Validation API

csharp

public interface IRecipeValidator
{
    ValidationResult Validate(InspectionRecipe recipe);
}

Do not make validation throw for normal invalid input.

Save pipeline API

csharp

public interface IImageSaver
{
    Task<Result<SaveImageOutcome>> SaveAsync(CapturedImage image, CancellationToken ct);
}

This supports partial success and warnings naturally.

14. Debugging and observability of result/field failures

One of the operational benefits of structured failure modeling is faster diagnosis.

How structured failure models help production debugging

If errors are modeled with codes and categories, support can quickly answer:

what happened
where it happened
how often it happens
which failures are operator errors vs system faults
which are retryable vs fatal

That is much better than searching logs for text fragments.

How error codes and categories improve supportability

Examples:

Recipe.Invalid
Machine.InvalidState
Machine.Command.Timeout
ImageSave.Thumbnail.Failed
Camera.Unavailable

These codes can drive:

dashboards
alert thresholds
support runbooks
alarm classifications
trend analysis

Correlating operator-visible failures with logs and telemetry

A strong pattern is to include:

operation/run id
error code
step name
machine id
recipe id
timestamp
correlation id

The operator may see:

Inspection completed with warnings. Preview images missing.

The log/telemetry can show:

RunId: R-20260417-1422
WarningCode: ImageSave.Thumbnail.Failed
Count: 12
Node: IPC-03
DiskFreeMB: 142
Step: ThumbnailGenerator

That correlation sharply reduces MTTR because support can go from symptom to cause much faster.

How experienced engineers use failure modeling to reduce MTTR

They design errors not just for code correctness, but for operations.

They ask:

can support distinguish operator misuse from machine fault?
can we count and trend this failure?
can we tell whether degraded mode was entered?
can we correlate the UI message to a stable code?
can we tell whether retries happened and why?

That is mature engineering.

15. Senior engineer mental model

This is the main shift.

A senior engineer stops thinking of failure as “the thing that happens in catch.” They think of failure as part of the domain model.

In a real system:

some negative outcomes are normal
some are warnings
some are partial success
some require recovery
some require operator action
some should stop immediately
some indicate a system defect

Those differences should appear in the design.

How experienced engineers think about expected outcomes vs true exceptions

They ask:

Is this an expected possibility in normal operation?
Does the caller need to branch on it?
Is it safe or unsafe?
Should it be visible in the signature?
Is it a business/domain outcome or a technical fault?
Does partial success matter here?
What should the operator see?
What should logs and telemetry retain?

If yes, it often belongs in a result model. If not, it may belong in exception flow.

How they keep error handling consistent across a large codebase

They establish conventions such as:

exceptions for unexpected/programming/invariant failures
result types for expected domain/application outcomes
validation returns structured validation result
adapter boundaries translate external faults into application-level errors
operator-facing messages never come directly from raw exceptions
background task failures must be observed and surfaced
error codes are stable and structured

This consistency matters more than theoretical purity.

How they design APIs that are honest about failure

Honest APIs tell callers what can happen.

Dishonest APIs hide important outcomes behind:

bool
null
generic exception
side effects
logs only

Good APIs make failure behavior discoverable and predictable.

How they keep failure understandable for both developers and operators

They separate views:

technical detail for developers and logs
meaningful categories for workflows
safe, actionable wording for operators

That separation is one of the marks of production-grade design.

A practical recommendation for interview-level thinking

If I had to summarize the whole topic into one practical rule set for a senior/principal interview, I would say this:

Use exceptions for things that are truly abnormal, unexpected, or represent bugs or broken assumptions.

Use Result-style models for things that are expected parts of business, workflow, machine state, validation, partial success, and warnings.

Translate low-level technical faults into meaningful application/domain failures at boundaries.

Design failure contracts explicitly, especially in long-running workflows, machine commands, and background pipelines.

Preserve enough detail for logs, telemetry, support, and diagnosis, but keep operator messages clear, safe, and actionable.

And above all: do not treat all failures as the same kind of thing. In real systems, failure has shape. Good engineers model that shape clearly.

If you want, next I can turn this into an interview-ready version with likely follow-up questions and strong sample answers.

Streaming Pipelines Dotnet Real World

Advanced error and result modeling in modern .NET systems ​

1. Big picture — why failure modeling matters as much as failure handling ​

Real examples ​

2. Different kinds of failure ​

Unexpected technical failures ​

Expected business or domain failures ​

Validation errors ​

Recoverable vs unrecoverable failures ​

Warnings vs hard failures ​

Partial success ​

Why treating all of these the same leads to poor design ​

3. Real problems in a WPF desktop app controlling a wafer inspection machine ​

Machine command may fail for very different reasons ​

Workflow may partially complete ​

UI needs operator-friendly messages while logs retain technical details ​

Data pipeline may continue with degraded behavior ​

Some failures should stop the run immediately ​

Others should only mark warnings ​

4. Exceptions vs result-style modeling ​

When exceptions are the right tool ​

When a Result-style return model is better ​

Why expected failures are often better as return values ​

Why unexpected failures are often better as exceptions ​

Trade-offs ​

5. Result pattern in practice ​

Example: ValidationResult ​

Example: StartInspectionResult ​

Example: SaveImageResult with warnings and partial success ​

Example: WorkflowStepResult ​

6. Domain errors vs technical errors ​

Example: vendor SDK throws native exception ​

Why this separation matters ​

7. Partial failure and degraded operation ​

Modeling partial success ​

A realistic aggregate result ​

Workflows that continue with warnings ​

Degraded modes ​

Collecting multiple errors instead of failing immediately ​

8. Error propagation across layers ​

A practical layered view ​

Infrastructure layer ​

Machine adapter layer ​

Application/workflow layer ​

UI/ViewModel layer ​

Example flow ​

Where to catch, where to rethrow, where to convert ​

9. Async, pipelines, and failure contracts ​

Failure modeling in async methods ​

Failure propagation in Task-based flows ​

Channel/pipeline stage failure handling ​

Example: hidden background save loop failure ​

Partial pipeline failure vs full workflow cancellation ​

Why hidden async failures are dangerous ​

10. How we use this in .NET in practice ​

Use exceptions for truly exceptional or unexpected failures ​

Use Result-like types for expected outcomes ​

Map low-level faults into meaningful application errors ​

Carry error codes and safe messages ​

Design APIs with explicit failure contracts ​

A more complete example ​

11. Common mistakes ​

Throwing exceptions for normal validation failures ​

Swallowing exceptions and returning generic “failed” ​

Returning bool with no reason ​

Leaking low-level technical errors directly to UI ​

Mixing domain failures and technical failures together ​

Inconsistent result styles across the codebase ​

Hiding async/background failures ​

No structured error codes or categories ​

12. Trade-offs ​

Simplicity vs explicitness ​

Exception-based flow vs Result-based flow ​

Rich error models vs complexity ​

Preserving detail vs keeping APIs readable ​

Consistency across system vs local optimization ​

13. Designing good failure contracts ​

What makes a good failure contract ​

How callers know what to expect ​

When API should force explicit handling ​

Advanced error and result modeling in modern .NET systems

1. Big picture — why failure modeling matters as much as failure handling

Real examples

2. Different kinds of failure

Unexpected technical failures

Expected business or domain failures

Validation errors

Recoverable vs unrecoverable failures

Warnings vs hard failures

Partial success

Why treating all of these the same leads to poor design

3. Real problems in a WPF desktop app controlling a wafer inspection machine

Machine command may fail for very different reasons

Workflow may partially complete

UI needs operator-friendly messages while logs retain technical details

Data pipeline may continue with degraded behavior

Some failures should stop the run immediately

Others should only mark warnings

4. Exceptions vs result-style modeling

When exceptions are the right tool

When a Result-style return model is better

Why expected failures are often better as return values

Why unexpected failures are often better as exceptions

Trade-offs

5. Result pattern in practice

Example: ValidationResult

Example: StartInspectionResult

Example: SaveImageResult with warnings and partial success

Example: WorkflowStepResult

6. Domain errors vs technical errors

Example: vendor SDK throws native exception

Why this separation matters

7. Partial failure and degraded operation

Modeling partial success

A realistic aggregate result

Workflows that continue with warnings

Degraded modes

Collecting multiple errors instead of failing immediately

8. Error propagation across layers

A practical layered view

Infrastructure layer

Machine adapter layer

Application/workflow layer

UI/ViewModel layer

Example flow

Where to catch, where to rethrow, where to convert

9. Async, pipelines, and failure contracts

Failure modeling in async methods

Failure propagation in Task-based flows

Channel/pipeline stage failure handling

Example: hidden background save loop failure

Partial pipeline failure vs full workflow cancellation

Why hidden async failures are dangerous

10. How we use this in .NET in practice

Use exceptions for truly exceptional or unexpected failures

Use Result-like types for expected outcomes

Map low-level faults into meaningful application errors

Carry error codes and safe messages

Design APIs with explicit failure contracts

A more complete example

11. Common mistakes

Throwing exceptions for normal validation failures

Swallowing exceptions and returning generic “failed”

Returning bool with no reason

Leaking low-level technical errors directly to UI

Mixing domain failures and technical failures together

Inconsistent result styles across the codebase

Hiding async/background failures

No structured error codes or categories

12. Trade-offs

Simplicity vs explicitness

Exception-based flow vs Result-based flow

Rich error models vs complexity

Preserving detail vs keeping APIs readable

Consistency across system vs local optimization

13. Designing good failure contracts

What makes a good failure contract

How callers know what to expect

When API should force explicit handling

Examples