Skip to content

Observability, logging, and debugging in .NET desktop systems

This topic sounds boring until the day a machine fails in front of a customer, the UI looks “stuck,” the inspection workflow stops halfway, and nobody can reproduce it in development.

That is the moment when observability stops being a nice engineering idea and becomes the thing that separates a professional system from a fragile one.

In industrial desktop systems, observability is not just “write some logs.” It is how you understand what the software believed was happening, what the machine was doing, what the operator clicked, which background task failed, and why the workflow ended up in the wrong state.


PART 1 — BIG PICTURE

Why observability is critical in real systems

In a normal business app, a bug may mean a broken page or a failed request.

In a wafer inspection desktop system, a bug may mean:

  • the machine stops mid-run
  • an inspection result is partially saved
  • a camera timeout causes a retry storm
  • the UI shows “Running” while the machine is actually in fault state
  • operators lose confidence because the system behaves unpredictably

These systems are hard because many things happen at the same time:

  • UI events
  • machine communication
  • background processing
  • image/result pipelines
  • persistence
  • hardware callbacks
  • workflow orchestration

When something goes wrong, you need to answer questions like:

  • What step was the workflow in?
  • Which command was sent to the machine?
  • Did the machine acknowledge it?
  • Did a timeout happen before or after the response arrived?
  • Did the UI reflect the real state or only some stale state?
  • Was the failure complete, or did only one subsystem fail?

Without observability, you are guessing.

With observability, you can reconstruct the story.

Why debugging production issues is much harder than development

In development, life is friendly:

  • you have the debugger
  • you can step through code
  • timing is slower and cleaner
  • the environment is controlled
  • hardware simulation may be stable
  • logs are easy to inspect locally

In production or in the field, life is very different:

  • the machine is real
  • timing is different
  • the user clicks in unpredictable ways
  • network/device latency varies
  • vendor SDKs behave differently under load
  • failures are intermittent
  • attaching a debugger may be impossible or too risky

The hardest bugs are usually not logic bugs. They are behavior bugs.

Examples:

  • “It only happens once every 3 days.”
  • “It happens only when the operator stops and quickly starts again.”
  • “It fails only under real throughput.”
  • “The machine replied, but the app acted like it timed out.”
  • “The screen froze, but then recovered.”

Those problems are rarely solved by staring at code alone. They are solved by reconstructing real runtime behavior.

Why logs are often the only source of truth in field failures

In field failures, the logs are often the only witness that was actually present.

A customer-reported bug usually comes with poor input:

  • “The app froze”
  • “Inspection failed”
  • “The machine disconnected”
  • “The wrong result appeared”
  • “It worked yesterday”

That is not enough.

A good logging system turns vague complaints into something actionable:

  • run 20260321-104455 started with recipe RCP-12A on machine M-03
  • wafer loaded successfully
  • autofocus command sent
  • machine response delayed 8.2 seconds
  • timeout threshold 5 seconds exceeded
  • workflow entered Recovery state
  • image pipeline still processing last frame
  • UI received stale status event after recovery
  • background task faulted due to disposed channel writer

Now you have a real story.

That is observability: making invisible runtime behavior visible.


PART 2 — HOW IT ACTUALLY WORKS

Structured logging

A lot of teams say they have logging, but what they really have is text dumping.

Bad log:

text
Inspection failed for machine 3 with recipe abc

Looks okay at first. But later, you want to search:

  • all failures for a specific run
  • all failures for machine M-03
  • all failures for recipe ABC-2026
  • all warnings before a specific error
  • average duration of autofocus step across runs

Plain text makes this hard.

Structured logging means you log a message plus named fields.

Example conceptually:

  • Message: Inspection step failed

  • Properties:

    • RunId = "RUN-20260321-001"
    • MachineId = "M-03"
    • Recipe = "ABC-2026"
    • Step = "AutoFocus"
    • DurationMs = 8200
    • ErrorCode = "Timeout"

Now your log backend, or even local file analysis, can filter and group by these fields.

This is a huge difference in production systems. You stop reading logs like novels and start querying them like data.

Log levels

Log levels are not just decoration. They are a signal of operational importance.

Information

Used for important normal events.

Examples:

  • inspection run started
  • recipe loaded
  • machine connected
  • workflow state changed
  • run completed

Info logs tell the flow of the system.

Warning

Used when something is off, but the system can still continue.

Examples:

  • machine response slower than normal
  • retry triggered
  • stale event ignored
  • optional result save failed but run continues
  • fallback path used

Warnings are important because they often explain why an eventual error happened later.

Error

Used for real failures that break an operation or require attention.

Examples:

  • inspection aborted
  • machine command failed
  • unhandled background task exception
  • database write failed for required result data

Errors should be meaningful, not noisy.

Debug / Trace

Used for very detailed internal behavior.

Examples:

  • every message received from device protocol
  • queue depth changes
  • every retry attempt
  • state machine transition checks
  • timing between pipeline stages

Useful when diagnosing deep issues, but dangerous if always enabled at high volume.

Correlation of events across components

In a desktop machine-control system, one user action often triggers work in many layers:

  • UI command
  • workflow service
  • machine controller
  • camera service
  • result processor
  • file persistence
  • event bus
  • background worker

If each component logs independently with no shared context, the logs become useless.

You need correlation.

For one inspection run, every relevant log should carry shared identifiers such as:

  • RunId
  • MachineId
  • LotId
  • WaferId
  • Recipe
  • sometimes SessionId or OperationId

That lets you reconstruct one logical story even though the work spans many classes, tasks, and threads.

Without correlation, your logs are just fragments.

With correlation, they become a timeline.


PART 3 — REAL PROBLEMS IN THIS SYSTEM

Using:

“A WPF desktop app controlling a wafer inspection machine”

Tracing an inspection run from start to finish

A real inspection run is not one method call. It is a distributed conversation inside one process.

Typical flow:

  1. Operator selects recipe
  2. UI requests run start
  3. workflow validates readiness
  4. machine moves to load position
  5. wafer loads
  6. autofocus starts
  7. image acquisition begins
  8. defect pipeline processes frames
  9. results save incrementally
  10. summary generated
  11. workflow completes

If the user says, “Run failed halfway,” you need to know exactly where halfway was.

Good logging lets you see:

  • when the run started
  • which state transitions occurred
  • which hardware commands were issued
  • what the machine returned
  • how long each step took
  • where the first abnormal event appeared

That means you should log the major lifecycle:

  • run created
  • state transitions
  • machine command send/response
  • retries
  • step duration
  • completion/abort reason

Not every tiny internal method. The important story.

Diagnosing machine communication issues

Hardware integration bugs are painful because the software and machine each blame the other.

Typical problems:

  • command sent but no reply
  • reply arrived late
  • malformed reply
  • duplicate response
  • disconnect during operation
  • SDK callback on unexpected thread
  • command acknowledged but machine never changed state

To diagnose this, logs need more than “communication failed.”

You need:

  • command name
  • machine/device id
  • sequence or request id if available
  • timeout threshold
  • actual wait duration
  • raw error code from SDK/protocol
  • connection state before and after
  • whether retry was attempted

For example, there is a big difference between:

  • no response ever arrived
  • response arrived after timeout
  • response arrived but parser failed
  • command succeeded but state poller still saw old state

These sound similar to the operator, but they have very different root causes.

Debugging race conditions or timing bugs

Timing bugs are the worst kind because the system often “looks correct” in code review.

Examples:

  • stop command races with machine-complete event
  • UI binds to stale view model state
  • background consumer processes an old frame after run cancellation
  • reconnect logic overlaps with active command execution
  • event order differs under load

The only way to understand this is often timeline logging.

You need timestamps and context around:

  • event received
  • state transition requested
  • state transition applied
  • cancellation requested
  • task completed
  • queue item dequeued
  • UI updated

Then you can see the ordering.

For example:

  • 10:15:01.102 Run canceled
  • 10:15:01.110 Image frame received
  • 10:15:01.114 Frame processing started
  • 10:15:01.130 Result publish skipped because run is canceled

This tells a healthy story.

But if you instead see:

  • 10:15:01.102 Run canceled
  • 10:15:01.130 Result publish completed

then you know canceled work still leaked through.

Understanding partial failures during workflows

Real systems often fail partially, not completely.

Examples:

  • inspection completed, but thumbnail save failed
  • machine moved correctly, but UI never updated
  • main result saved, but defect overlay generation failed
  • live stream disconnected, but inspection continued
  • summary report failed, but raw data exists

If your logs only record final success/failure, you lose the nuance.

A mature system logs per sub-operation and records degradation clearly.

That matters because the recovery action depends on what failed.

  • If only visualization failed, do not re-run the wafer.
  • If raw images are missing, re-run may be required.
  • If save completed but UI showed failure, the operator may need reassurance, not retry.
  • If report generation failed after successful inspection, treat it as post-processing failure, not machine failure.

This is why workflow-level observability matters. It helps separate process failure from subsystem failure.


PART 4 — HOW WE USE IT IN .NET (PRACTICAL)

Below are practical patterns using Microsoft.Extensions.Logging.

Structured logging with context

csharp
using Microsoft.Extensions.Logging;

public sealed class InspectionService
{
    private readonly ILogger<InspectionService> _logger;
    private readonly IMachineController _machineController;

    public InspectionService(
        ILogger<InspectionService> logger,
        IMachineController machineController)
    {
        _logger = logger;
        _machineController = machineController;
    }

    public async Task StartInspectionAsync(
        string runId,
        string machineId,
        string recipe,
        CancellationToken cancellationToken)
    {
        _logger.LogInformation(
            "Inspection run started. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
            runId, machineId, recipe);

        try
        {
            await _machineController.LoadRecipeAsync(machineId, recipe, cancellationToken);

            _logger.LogInformation(
                "Recipe loaded successfully. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
                runId, machineId, recipe);

            await _machineController.StartInspectionAsync(machineId, cancellationToken);

            _logger.LogInformation(
                "Inspection command accepted. RunId={RunId}, MachineId={MachineId}",
                runId, machineId);
        }
        catch (OperationCanceledException)
        {
            _logger.LogWarning(
                "Inspection run canceled. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
                runId, machineId, recipe);
            throw;
        }
        catch (Exception ex)
        {
            _logger.LogError(
                ex,
                "Inspection run failed during startup. RunId={RunId}, MachineId={MachineId}, Recipe={Recipe}",
                runId, machineId, recipe);
            throw;
        }
    }
}

The important thing here is not the syntax. It is that the log carries reusable context.

Using scopes for correlated logs

Scopes are very useful in .NET logging when many downstream calls should automatically inherit the same context.

csharp
using Microsoft.Extensions.Logging;

public sealed class InspectionRunCoordinator
{
    private readonly ILogger<InspectionRunCoordinator> _logger;
    private readonly InspectionWorkflow _workflow;

    public InspectionRunCoordinator(
        ILogger<InspectionRunCoordinator> logger,
        InspectionWorkflow workflow)
    {
        _logger = logger;
        _workflow = workflow;
    }

    public async Task ExecuteRunAsync(
        string runId,
        string machineId,
        string recipe,
        CancellationToken cancellationToken)
    {
        using var scope = _logger.BeginScope(new Dictionary<string, object>
        {
            ["RunId"] = runId,
            ["MachineId"] = machineId,
            ["Recipe"] = recipe
        });

        _logger.LogInformation("Inspection workflow execution started.");

        await _workflow.PrepareAsync(cancellationToken);
        await _workflow.RunAsync(cancellationToken);
        await _workflow.CompleteAsync(cancellationToken);

        _logger.LogInformation("Inspection workflow execution finished successfully.");
    }
}

Now every log inside that scope can inherit the contextual properties, depending on the logging provider.

This is one of the best ways to keep correlation consistent.

Logging important state transitions

State transitions are some of the most valuable logs in industrial systems.

csharp
public enum InspectionState
{
    Idle,
    Preparing,
    Running,
    Completing,
    Completed,
    Error,
    Aborted
}

public sealed class InspectionStateMachine
{
    private readonly ILogger<InspectionStateMachine> _logger;

    public InspectionStateMachine(ILogger<InspectionStateMachine> logger)
    {
        _logger = logger;
    }

    public InspectionState CurrentState { get; private set; } = InspectionState.Idle;

    public void TransitionTo(
        InspectionState newState,
        string runId,
        string reason)
    {
        var oldState = CurrentState;
        CurrentState = newState;

        _logger.LogInformation(
            "Inspection state changed. RunId={RunId}, OldState={OldState}, NewState={NewState}, Reason={Reason}",
            runId, oldState, newState, reason);
    }
}

This kind of log is gold during incident analysis.

When the customer says, “It got stuck,” this tells you where it got stuck.

Capturing errors with enough context

A very common mistake is logging an exception without the operation context.

Bad:

csharp
_logger.LogError(ex, "Save failed");

Better:

csharp
_logger.LogError(
    ex,
    "Failed to save inspection result. RunId={RunId}, WaferId={WaferId}, ResultPath={ResultPath}, Step={Step}",
    runId,
    waferId,
    resultPath,
    "PersistFinalResult");

Now the log tells you:

  • what failed
  • for which run
  • for which wafer
  • at which step
  • against which output path

That is the minimum needed to investigate.

Logging async and background operations correctly

Desktop systems often have background loops for polling, streaming, processing, and health monitoring.

These loops are dangerous because failures can become invisible.

csharp
public sealed class MachineStatusMonitor
{
    private readonly ILogger<MachineStatusMonitor> _logger;
    private readonly IMachineGateway _machineGateway;

    public MachineStatusMonitor(
        ILogger<MachineStatusMonitor> logger,
        IMachineGateway machineGateway)
    {
        _logger = logger;
        _machineGateway = machineGateway;
    }

    public async Task RunAsync(string machineId, CancellationToken cancellationToken)
    {
        using var scope = _logger.BeginScope(new Dictionary<string, object>
        {
            ["MachineId"] = machineId
        });

        _logger.LogInformation("Machine status monitor started.");

        while (!cancellationToken.IsCancellationRequested)
        {
            try
            {
                var status = await _machineGateway.GetStatusAsync(machineId, cancellationToken);

                _logger.LogDebug(
                    "Machine status polled. Status={Status}, IsConnected={IsConnected}",
                    status.State,
                    status.IsConnected);

                await Task.Delay(TimeSpan.FromMilliseconds(500), cancellationToken);
            }
            catch (OperationCanceledException)
            {
                _logger.LogInformation("Machine status monitor stopping due to cancellation.");
                break;
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Unhandled exception in machine status monitor loop.");

                // Optional backoff to avoid hot failure loops
                try
                {
                    await Task.Delay(TimeSpan.FromSeconds(2), cancellationToken);
                }
                catch (OperationCanceledException)
                {
                    break;
                }
            }
        }

        _logger.LogInformation("Machine status monitor stopped.");
    }
}

Important lessons here:

  • background loops should log start and stop
  • exceptions must be caught inside the loop
  • repeated failures should not create a CPU-burning retry storm
  • cancellation should be logged differently from errors

Timing important operations

Latency is often the hidden cause of workflow issues.

csharp
public async Task SendCommandAsync(
    string machineId,
    string commandName,
    string runId,
    CancellationToken cancellationToken)
{
    var startedAt = DateTime.UtcNow;
    var sw = System.Diagnostics.Stopwatch.StartNew();

    try
    {
        _logger.LogInformation(
            "Sending machine command. RunId={RunId}, MachineId={MachineId}, Command={Command}",
            runId, machineId, commandName);

        await _machineGateway.SendAsync(machineId, commandName, cancellationToken);

        sw.Stop();

        _logger.LogInformation(
            "Machine command completed. RunId={RunId}, MachineId={MachineId}, Command={Command}, DurationMs={DurationMs}",
            runId, machineId, commandName, sw.ElapsedMilliseconds);
    }
    catch (Exception ex)
    {
        sw.Stop();

        _logger.LogError(
            ex,
            "Machine command failed. RunId={RunId}, MachineId={MachineId}, Command={Command}, DurationMs={DurationMs}, StartedAtUtc={StartedAtUtc}",
            runId, machineId, commandName, sw.ElapsedMilliseconds, startedAt);

        throw;
    }
}

Timing logs are extremely useful for diagnosing slowdowns before they become outright failures.


PART 5 — COMMON MISTAKES (VERY REALISTIC)

Logging too little

This is the classic failure.

Symptoms:

  • only final error is logged
  • no state transitions
  • no operation ids
  • no step durations
  • no context about input or machine state

Production consequence:

You know something failed, but you cannot explain why. Engineers end up guessing, adding temporary logs, and waiting for the bug to happen again.

This is expensive and embarrassing in front of customers.

Logging too much

The opposite problem is also real.

Symptoms:

  • every method entry/exit logged
  • every UI binding event logged
  • every loop iteration at Info level
  • huge raw payload dumps for every message
  • thousands of repetitive logs during normal operation

Production consequence:

  • storage cost grows
  • important signals drown in noise
  • incident analysis becomes slower, not faster
  • log viewers become almost unusable
  • performance may degrade under load

A noisy system is not an observable system. It is just a loud system.

Missing context

This is deadly.

You may have hundreds of logs, but none say:

  • which run?
  • which machine?
  • which wafer?
  • which recipe?
  • which workflow step?

Production consequence:

You cannot reconstruct one incident from concurrent activity.

In industrial apps, many things may be happening at once. Without context, all failures blur together.

Logging only errors without flow

Many teams only log when something goes wrong.

That sounds reasonable, but it breaks debugging.

Why?

Because the error log tells you what failed, but not what led to it.

Example:

  • Error: Timeout while waiting for AutoFocus

Useful, but incomplete.

You also want to know:

  • was the recipe just switched?
  • had the machine recently reconnected?
  • did a slow warning happen 20 seconds earlier?
  • was the workflow already in recovery mode?
  • had cancellation already been requested?

Production consequence:

You see the crash, not the path to the crash.

Ignoring background task failures

This is one of the most dangerous desktop-system mistakes.

Examples:

  • fire-and-forget task throws and nobody observes it
  • polling loop dies silently
  • channel consumer exits and pipeline quietly stops
  • retry worker crashes and never restarts

From the operator’s view, the system becomes “weird”:

  • data stops updating
  • UI still looks alive
  • machine state no longer refreshes
  • workflows hang waiting for signals that no worker is processing

Production consequence:

Silent data loss, stale UI, stuck workflows, and terrifying nondeterministic behavior.

In real systems, background work must be supervised, and failures must be surfaced loudly.


PART 6 — PERFORMANCE & TRADE-OFFS

Logging overhead

Logging is not free.

Costs include:

  • string formatting
  • allocation of objects/properties
  • serialization cost for structured fields
  • disk or network I/O
  • lock contention in some sinks
  • pressure on CPU and memory under heavy volume

In high-throughput systems, careless logging can become part of the performance problem.

Examples:

  • logging every frame in an image pipeline at Info level
  • logging every status poll at high frequency
  • writing logs synchronously to disk from hot paths
  • logging large payloads or raw image metadata constantly

Synchronous vs asynchronous logging

Synchronous logging is simpler, but riskier for performance-sensitive operations.

If a hot path waits for disk I/O or slow log sink flushing, you introduce latency into the production flow.

That is bad in machine-control paths.

Asynchronous logging reduces the direct impact on the caller, but introduces trade-offs:

  • logs may be delayed
  • buffered logs may be lost on crash if not flushed
  • queue overflow strategies matter
  • diagnosing shutdown issues gets harder

In practice, many production systems use async/buffered sinks for throughput, but are careful to flush on shutdown and keep critical failure paths reliable.

Balancing detail vs performance

This is senior-level judgment.

You do not want to log everything. You want to log the things that explain behavior.

Good candidates for always-on Info logs:

  • run started/completed/aborted
  • state transitions
  • machine connect/disconnect
  • command send/complete/fail
  • workflow step start/finish/fail
  • critical retries and recoveries

Good candidates for Debug logs:

  • detailed protocol chatter
  • queue depth changes
  • polling details
  • fine-grained timing
  • verbose SDK callback traces

The practical pattern is:

  • keep high-value lifecycle logs always on
  • keep high-volume diagnostics available but controlled by level/configuration
  • avoid expensive payload logging in hot loops unless temporarily enabled for incident analysis

PART 7 — SENIOR ENGINEER THINKING

How experienced engineers design logging strategy

A senior engineer does not treat logging as an afterthought. They design it as part of system behavior.

They ask:

  • What failures will happen in the field?
  • What will support or engineers need to know?
  • Which workflows need end-to-end traceability?
  • Which identifiers must be present on every log?
  • Which background processes can fail silently?
  • What should be visible at Info vs Debug?
  • How will we diagnose timing issues?

That means logging is designed around real operational questions, not random LogInformation calls.

What to log vs what not to log

Log:

  • lifecycle events
  • state transitions
  • external commands and outcomes
  • retries, fallbacks, timeouts
  • workflow boundaries
  • background worker start/stop/failure
  • operation durations
  • degraded modes and partial failures

Usually do not log:

  • every trivial method call
  • repetitive noise with no diagnostic value
  • huge objects or payloads by default
  • sensitive data
  • high-frequency events at high severity

The question is always:

Will this help explain system behavior later?

If yes, it is probably worth logging. If not, it is probably noise.

How to make logs actionable for debugging

Actionable logs answer real engineering questions.

A useful log usually contains:

  • what happened
  • where it happened
  • which operation/run it belongs to
  • which entity was involved
  • what the system was trying to do
  • whether it succeeded, failed, retried, or degraded
  • how long it took
  • exception details if relevant

Bad log:

text
Error in workflow

Actionable log:

text
Failed to transition workflow step from AutoFocus to Capture after machine timeout.
RunId=RUN-20260321-001 MachineId=M-03 Recipe=ABC-2026 DurationMs=5102 RetryCount=2

That gives engineers something to work with.

How to design systems that are diagnosable under pressure

This is the real mark of maturity.

When production is on fire, nobody wants clever architecture that cannot explain itself.

Diagnosable systems have these traits:

  • explicit states instead of random booleans
  • clear workflow boundaries
  • correlated logs across components
  • supervised background tasks
  • meaningful error classification
  • timing visibility
  • recoveries and retries logged as first-class events
  • enough information to reconstruct a timeline

A senior engineer thinks beyond “Does it work?” They think: “When it fails at 2 AM on a customer machine, can we understand it fast?”

That mindset changes architecture.

You start building systems that expose their own behavior instead of hiding it.


Final takeaway

In industrial .NET desktop systems, observability is not just about logs. It is about making runtime truth visible.

A production-grade WPF machine-control system is full of concurrency, timing sensitivity, hardware uncertainty, and long-running workflows. When issues happen, the debugger is usually gone. The code is no longer enough. What matters is the evidence the system left behind.

Good logging gives you that evidence.

Not too little. Not too much. Just enough structured, correlated, high-value information to reconstruct what really happened.

That is how senior engineers design systems that can survive real production pressure.

If you want, I’ll do the same style deep dive next for metrics/tracing vs logging, or for Serilog + Microsoft.Extensions.Logging architecture in WPF desktop apps.

Docs-first project memory for AI-assisted implementation.