Vision System Failures & Diagnostics

Principal Software Architect View

This topic sits directly inside the roadmap area for Vision, Imaging & Inspection Systems, especially camera acquisition, triggered capture, image buffering, alignment, defect detection, storage/retrieval, overlays, and motion integration . It also strongly overlaps with diagnostics, replay, traceability, and long-running machine behavior because vision failures are rarely isolated software bugs.

PART 1 — Why Vision Failures Are Hard to Debug

Vision failures are hard because the symptom usually appears at the end of the pipeline, but the root cause may be much earlier.

An operator sees:

text

Missed defect
False defect
Bad measurement
Missing image
Delayed result
Wrong overlay

But the actual cause may be:

text

Lighting drift
Focus drift
Camera exposure change
Trigger timing error
Frame buffer overflow
Wrong recipe version
Wrong coordinate transform
Image assigned to wrong wafer position
Processing backlog
Storage queue blocking acquisition

The dangerous mistake is assuming:

text

Bad inspection result = bad algorithm

In production, many “algorithm problems” are actually system correlation problems.

Example:

text

Symptom:
    Defect detection suddenly becomes unstable.

First suspicion:
    The threshold algorithm is bad.

Actual root cause:
    The lens focus slowly drifted during the shift.
    The image was still valid as an image file,
    but it was no longer valid as inspection evidence.

Another example:

text

Symptom:
    Some wafer regions have wrong inspection results.

First suspicion:
    Camera occasionally fails.

Actual root cause:
    Processing latency increased.
    Result N was applied to wafer position N+1.

This is why strong industrial vision systems are not designed only to “produce results”. They are designed to explain results later.

PART 2 — Failure Layers in a Vision System

A practical vision stack looks like this:

text

+------------------------------------------------------+
| 12. Storage / Traceability                           |
|     Images, metadata, recipes, results, history      |
+------------------------------------------------------+
| 11. UI Visualization                                 |
|     Live image, overlays, result review              |
+------------------------------------------------------+
| 10. Workflow Integration                             |
|     Lot / wafer / step / region / retry context      |
+------------------------------------------------------+
|  9. Detection / Measurement / Decision Logic         |
|     Thresholds, rules, classification, pass/fail     |
+------------------------------------------------------+
|  8. Alignment / Registration                         |
|     Fiducials, transforms, coordinate mapping        |
+------------------------------------------------------+
|  7. Processing Pipeline                              |
|     Filtering, correction, feature extraction        |
+------------------------------------------------------+
|  6. Image Quality Validation                         |
|     Focus, exposure, contrast, blur, saturation      |
+------------------------------------------------------+
|  5. Buffering / Transfer                             |
|     SDK buffers, queues, frame grabber, memory       |
+------------------------------------------------------+
|  4. Triggering / Timing                              |
|     Hardware trigger, software trigger, timestamps   |
+------------------------------------------------------+
|  3. Camera Acquisition                               |
|     Exposure, gain, frame ID, camera state           |
+------------------------------------------------------+
|  2. Illumination / Optics                            |
|     Light, lens, focus, glare, contamination         |
+------------------------------------------------------+
|  1. Physical Scene / Part / Wafer                    |
|     Actual object, surface, position, condition      |
+------------------------------------------------------+

Each layer can fail differently.

Layer	What Can Fail	Production Symptom	Useful Evidence
Physical scene	Wrong part, tilted wafer, contamination	Unexpected defects	Wafer ID, position, image sample
Illumination/optics	Light intensity drift, glare, dirty lens	False defects, unstable contrast	Light setting, exposure, saved image
Camera acquisition	Missed exposure, wrong gain, camera not ready	Missing/black image	Frame ID, camera status, timestamp
Triggering/timing	Trigger too early/late	Image at wrong position	Trigger ID, motion position
Buffering/transfer	Overflow, backlog, dropped frame	Missing/late image	Queue depth, dropped counters
IQ validation	Bad image accepted	Valid file, invalid inspection	Focus/contrast/saturation metrics
Processing pipeline	Wrong parameter set	Inconsistent output	Pipeline config, stage outputs
Alignment	Fiducial mismatch, wrong transform	Overlay shift, bad measurement	Transform, confidence score
Decision logic	Threshold too sensitive	False pass/fail	Rule version, decision reason
Workflow	Result assigned to wrong step	Wrong wafer/region result	Workflow step ID, correlation ID
UI	Shows latest image, not inspected image	Operator misled	Display frame ID vs result frame ID
Storage	Missing metadata or wrong recipe	Replay impossible	Evidence package completeness

PART 3 — Acquisition & Frame Failures

Acquisition failures are dangerous because they often appear intermittently and only at production speed.

1. Dropped Frames

text

Trigger 101 -> Frame 101 received
Trigger 102 -> No frame
Trigger 103 -> Frame 103 received

Production symptom:

text

One region has no image.
Inspection result missing.
Machine retries occasionally.
Operator says: "It only happens sometimes."

Likely causes:

text

Camera bandwidth limit
Frame grabber buffer overflow
Processing queue too slow
Storage blocking acquisition
Trigger rate too high
SDK buffer pool exhausted

Evidence needed:

text

Frame ID
Trigger ID
Camera dropped-frame counter
SDK buffer overflow counter
Queue depth
Timestamp at trigger / receive / process

2. Duplicated Frames

text

Trigger 201 -> Frame A
Trigger 202 -> Frame A again

Production symptom:

text

Two wafer positions appear to have identical image content.
Inspection result looks plausible but is spatially wrong.

Likely causes:

text

SDK reused last frame after timeout
Software reused previous buffer
UI displayed latest cached image
Frame ID not checked
Metadata copied incorrectly

Evidence needed:

text

Frame ID
Image checksum/hash
Trigger ID
Buffer ID
Capture timestamp
Workflow step ID

A robust system should detect:

text

Same frame ID used for different trigger IDs
Same image hash used for different physical positions
Result frame ID does not match expected trigger ID

3. Corrupted Frames

Production symptom:

text

Image has broken lines, partial image, wrong dimensions, random noise, or black bands.

Likely causes:

text

Transfer interruption
DMA/buffer issue
Cable/interference
SDK memory ownership violation
Frame read before complete

Evidence needed:

text

Image dimensions
Pixel format
Frame completion flag
CRC/checksum if available
Camera status
Transfer error code
Raw saved image

4. Late Frames

text

Trigger at T0
Expected frame by T0 + 20 ms
Actual frame arrives at T0 + 180 ms

Production symptom:

text

Machine seems correct at low speed.
At full throughput, result arrives too late.
Downstream decision uses stale information.

Likely causes:

text

Processing backlog
Thread scheduling delay
GC pause
Storage queue blocking
Unbounded queue growth
Low-priority acquisition thread

Evidence needed:

text

Capture timestamp
Receive timestamp
Processing start/end timestamp
Result publication timestamp
Queue depth over time
GC/memory counters

5. Trigger Accepted but No Frame Produced

Production symptom:

text

Motion controller says trigger fired.
Camera log says nothing arrived.
Inspection step times out.

Likely causes:

text

Camera not armed
Trigger polarity mismatch
Exposure time longer than trigger interval
Hardware line issue
Frame grabber missed trigger
Wrong acquisition mode

Evidence needed:

text

Camera armed state
Trigger counter
Hardware IO event log
Camera exposure state
Frame grabber trigger count
Machine sequence state

PART 4 — Image Quality Failures

Image quality failures are subtle because the image may be technically present and readable.

text

The file exists.
The image opens.
The pipeline runs.
The result is wrong.

That is the dangerous case.

Common Image Quality Failures

Blur / Focus Drift

Symptoms:

text

Edges become soft.
Measurement repeatability decreases.
Defect confidence drops.
False negatives increase.

Why it misleads:

text

The algorithm still runs.
No exception occurs.
The image looks "almost okay" to a human.

Evidence:

text

Focus metric
Edge sharpness score
Autofocus position
Z position
Saved image samples over time

Underexposure / Overexposure

Symptoms:

text

Dark image
Washed-out image
Lost surface detail
Unstable thresholding

Evidence:

text

Mean intensity
Histogram
Exposure time
Gain
Light intensity setting
Saturation percentage

Saturation

Saturation is especially dangerous because information is permanently lost.

text

Pixel value = max
Surface detail = gone
Algorithm confidence = fake

Evidence:

text

Percentage of saturated pixels
Region-specific saturation metric
Lighting setting
Exposure/gain setting
Raw image

Low Contrast

Symptoms:

text

Defect boundary becomes weak.
Alignment fiducial becomes unstable.
Measurement varies between runs.

Evidence:

text

Contrast metric
Histogram spread
Signal-to-noise ratio
Saved good/bad comparison images

Lighting Non-Uniformity

Symptoms:

text

Left side of image fails more often than right side.
Defect rate depends on position in image.
Overlay looks correct but threshold behaves differently across field.

Evidence:

text

Flat-field correction version
Illumination map
Per-region intensity statistics
Camera/lens calibration data

Reflection / Glare

Symptoms:

text

Bright spots interpreted as defects.
Defects hidden by reflection.
Failure depends on wafer surface, angle, or material.

Evidence:

text

Raw image
Lighting angle/settings
Product type
Wafer orientation
Recipe illumination mode

Contamination on Optics

Symptoms:

text

Same artifact appears in many images.
False defect appears in fixed camera coordinates, not wafer coordinates.

Experienced diagnosis:

text

If the artifact stays in image coordinates,
suspect optics/camera.

If the artifact moves with wafer coordinates,
suspect product/wafer.

PART 5 — Synchronization & Correlation Failures

Synchronization failures are among the most painful because every individual component may look correct.

The camera works. The motion system works. The algorithm works. The database works.

But the association is wrong.

Example Timeline

text

Time ─────────────────────────────────────────────────────>

Motion:
    Move to P1 -------- Move to P2 -------- Move to P3 ---->

Trigger:
              T1                 T2                 T3

Camera:
              F1                 F2                 F3

Processing:
                   Process F1         Process F2         Process F3

Result:
                        R1                 R2                 R3

Correct Mapping:
    P1 -> T1 -> F1 -> R1
    P2 -> T2 -> F2 -> R2
    P3 -> T3 -> F3 -> R3

Now imagine processing slows down:

text

Time ─────────────────────────────────────────────────────>

Motion:
    Move to P1 -------- Move to P2 -------- Move to P3 ---->

Trigger:
              T1                 T2                 T3

Camera:
              F1                 F2                 F3

Processing:
                        Process F1
                              Process F2
                                    Process F3

Bad Result Application:
    Current workflow position = P2
    Result received = R1

Wrong Mapping:
    R1 applied to P2

The image was real. The result was real. The failure was correlation.

Required Correlation Metadata

Every inspection result should carry:

text

Lot ID
Wafer ID / Part ID
Region ID / Die ID / Position ID
Workflow step ID
Trigger ID
Frame ID
Camera ID
Capture timestamp
Machine position at capture
Recipe ID/version
Processing pipeline version
Result ID

Without this, debugging becomes guesswork.

PART 6 — Processing & Inspection Instability

Processing failures often come from unstable assumptions.

Common Causes

text

Preprocessing parameter no longer fits current product
Lighting variation exceeds threshold margin
Alignment confidence is low but ignored
Measurement uses wrong calibrated scale
Algorithm output is converted into pass/fail without reason
Recipe change affects only one product variant

Why Replay Matters

A strong system allows engineers to ask:

text

Given the same image,
same recipe,
same calibration,
same algorithm version,
do we get the same result?

If yes:

text

The issue is likely in acquisition, image quality, recipe, or physical process.

If no:

text

The issue may be nondeterministic processing,
thread safety,
uninitialized state,
floating configuration,
or version mismatch.

Intermediate Diagnostics

Do not save only final pass/fail.

For difficult inspections, capture:

text

Raw image
Corrected image
Region of interest
Alignment overlay
Detected features
Measurement values
Threshold values
Confidence scores
Decision reason

Bad diagnostic design:

text

Result = FAIL

Good diagnostic design:

text

Result = FAIL
Reason = WidthOutOfTolerance
MeasuredWidth = 12.84 um
Limit = 12.50 um
FrameId = 882193
TriggerId = 771002
RecipeVersion = RCP-17.4
AlignmentConfidence = 0.91
FocusScore = 0.76
ProcessingTimeMs = 18

PART 7 — Diagnostic Evidence to Capture

A production-grade vision system should capture an evidence package.

text

InspectionEvidencePackage
{
    Identity:
        LotId
        WaferId / PartId
        RegionId
        WorkflowStepId
        InspectionId

    Acquisition:
        CameraId
        FrameId
        TriggerId
        CaptureTimestamp
        ReceiveTimestamp
        Exposure
        Gain
        PixelFormat
        CameraStatus

    Motion / Timing:
        MachinePosition
        AxisPositions
        MotionState
        TriggerTimestamp
        PositionAtTrigger

    Image:
        RawImageReference
        SampleImageReference
        ImageHash
        Width
        Height

    ImageQuality:
        FocusScore
        ContrastScore
        SaturationPercent
        NoiseMetric
        BrightnessMean

    Recipe / Config:
        RecipeId
        RecipeVersion
        AlgorithmVersion
        CalibrationVersion
        LightSettings

    Pipeline:
        StageTimings
        IntermediateArtifacts
        ProcessingWarnings

    Alignment:
        Transform
        FiducialPositions
        AlignmentConfidence

    Decision:
        Measurements
        Thresholds
        Classification
        PassFail
        DecisionReason

    Runtime:
        QueueDepth
        DroppedFrameCount
        BufferOverflowCount
        ErrorContext
}

The most important rule:

text

Capture evidence before retry, reset, or recovery.

Because retry often destroys the original context.

Example:

text

Failure occurs.
System retries.
Second attempt succeeds.
Operator sees success.
Original failed image is lost.
Root cause becomes invisible.

A strong system records:

text

First attempt failed because image quality score was below threshold.
Retry succeeded after autofocus correction.

That difference matters.

PART 8 — Replay, Offline Analysis & Reproducibility

Replay means:

text

Stored image + stored metadata + stored recipe/config
        ↓
Run inspection offline
        ↓
Compare output with production result

Replay Flow

text

+-------------------------+
| Production Inspection   |
| Raw image + metadata    |
+-----------+-------------+
            |
            v
+-------------------------+
| Evidence Store          |
| Image, recipe version,  |
| calibration, settings   |
+-----------+-------------+
            |
            v
+-------------------------+
| Offline Replay Tool     |
| Re-run same pipeline    |
+-----------+-------------+
            |
            v
+-------------------------+
| Comparison              |
| Old result vs new result|
| Machine A vs Machine B  |
| Recipe X vs Recipe Y    |
+-------------------------+

Replay helps compare:

text

Good run vs failing run
Old algorithm vs new algorithm
Machine A vs Machine B
Before recipe change vs after recipe change
Production result vs offline result

But replay has limits.

Replay may not reproduce:

text

Dropped frames
Trigger timing problems
Camera readiness problems
Buffer overflow
Motion/image correlation mistakes
Thread scheduling issues
Hardware noise

Replay is strongest for:

text

Image quality analysis
Processing instability
Algorithm regression
Recipe validation
Alignment/debug review
Decision explanation

Replay only works if the original context was stored correctly.

Bad replay package:

text

image_001.png

Good replay package:

text

image_001.png
metadata.json
recipe_R17.4.json
calibration_C8.2.json
pipeline_version.txt
result_original.json
stage_timings.json

PART 9 — Real-World Failure Scenarios

Scenario 1 — Image Quality Slowly Degrades

Production symptom:

text

False defects increase during the shift.
No single hard failure.
Operators start overriding results.

Why it misleads:

text

The camera still works.
The algorithm still works.
The images still look acceptable at first glance.

Experienced diagnosis:

text

Plot focus score, brightness, contrast, and false defect rate over time.
Compare early-shift images with late-shift images.
Check lens contamination, lighting temperature, focus drift, vibration.

Strong design:

text

Image quality metrics are stored per inspection.
Sample images are retained.
Trend charts exist for service engineers.

Scenario 2 — Missing Frames Only at Full Throughput

Production symptom:

text

Lab testing passes.
Slow mode passes.
Production speed fails intermittently.

Why it misleads:

text

Engineers test one frame at a time.
The issue only appears when acquisition, processing, UI, and storage run together.

Experienced diagnosis:

text

Check queue depth under full load.
Check dropped-frame counters.
Check processing latency distribution.
Check whether storage blocks acquisition.
Check GC and memory pressure.

Strong design:

text

Bounded queues
Backpressure strategy
Per-stage timing
Dropped-frame counters
Acquisition independent from UI/storage

Scenario 3 — Overlay Looks Correct but Stored Result Uses Different Coordinate Frame

Production symptom:

text

Operator review screen looks correct.
Database result location is wrong.
Downstream analysis says defect is in another region.

Why it misleads:

text

The UI may apply one transform.
The result exporter may apply another.

Experienced diagnosis:

text

Compare image coordinates, machine coordinates, wafer coordinates, and display coordinates.
Check transform version.
Check calibration ID.
Check whether overlay and result use the same registration output.

Strong design:

text

Coordinate frame is explicit in every result.
Transform version is stored.
Overlay rendering uses the same evidence package as result storage.

Scenario 4 — Inspection Fails Only on One Machine

Production symptom:

text

Same recipe works on Machine A.
Fails on Machine B.

Why it misleads:

text

Software version may be identical.
But optics, camera calibration, lighting, mechanical alignment, or firmware may differ.

Experienced diagnosis:

text

Compare camera calibration.
Compare illumination intensity.
Compare focus position.
Compare image quality metrics.
Replay Machine B images on Machine A software.
Compare raw images, not only results.

Strong design:

text

Machine identity
Calibration version
Camera serial number
Light controller settings
Firmware versions
All stored with inspection evidence

Scenario 5 — Replay Gives Different Result Because Recipe Version Was Not Stored

Production symptom:

text

Engineer replays failed image.
Offline result does not match production result.

Why it misleads:

text

The image is correct.
But the current recipe is not the recipe used during production.

Experienced diagnosis:

text

Find exact recipe version used at inspection time.
Check parameter audit trail.
Check whether recipe was edited online.

Strong design:

text

Immutable recipe snapshot per production run.
Recipe version stored with every result.
Replay uses historical recipe, not latest recipe.

Scenario 6 — Result Delayed and Applied to Wrong Wafer Region

Production symptom:

text

Defect map appears shifted.
Failures occur only during high load.

Why it misleads:

text

Every individual result looks valid.
The wrongness is in result-to-region mapping.

Experienced diagnosis:

text

Trace InspectionId from trigger to frame to result to storage.
Compare result timestamp with current workflow step.
Verify the workflow does not use "current position" when result arrives.

Strong design:

text

Result carries original region ID.
Workflow never infers context from current state.
Correlation ID is immutable.

Scenario 7 — UI Hides Frame Drops by Showing Latest Image Only

Production symptom:

text

Operator sees live images.
System reports missing inspection data.
No one believes frames are missing.

Why it misleads:

text

The UI is not showing the inspected frame.
It is showing the latest available frame.

Experienced diagnosis:

text

Display frame ID on UI.
Compare displayed frame ID with inspection result frame ID.
Check whether UI subscribes to acquisition stream or result stream.

Strong design:

text

UI clearly distinguishes:
- live camera image
- image being processed
- image used for result
- reviewed stored image

Scenario 8 — Storage Queue Blocks and Causes Acquisition Backlog

Production symptom:

text

Machine slows down.
Frames occasionally drop.
Disk usage or network storage latency spikes.

Why it misleads:

text

The failure appears in acquisition.
The real root cause is storage backpressure.

Experienced diagnosis:

text

Check storage queue depth.
Check disk/network latency.
Check whether image saving happens on acquisition path.
Check whether retention policy changed.

Strong design:

text

Acquisition path is isolated.
Storage is asynchronous and bounded.
If storage cannot keep up, the system raises a clear alarm before acquisition collapses.

PART 10 — Software Design Implications

A vision system must be designed for diagnosis from day one.

Bad Design

text

Camera captures image
Algorithm returns pass/fail
UI shows green/red
Database stores final result

Missing:

text

No frame ID
No trigger ID
No recipe version
No image quality metrics
No alignment confidence
No stage timings
No saved image
No decision reason
No replay package

This system may work during demo but fails under production pressure.

Good Design

text

Camera / Trigger / Motion
        |
        v
+----------------------+
| Frame + Metadata     |
| FrameId              |
| TriggerId            |
| Timestamp            |
| Position             |
+----------+-----------+
           |
           v
+------------------------------+
| Vision Pipeline              |
| IQ metrics                   |
| Stage timings                |
| Alignment confidence         |
| Intermediate diagnostics     |
+----------+-------------------+
           |
           v
+------------------------------+
| Inspection Result            |
| Measurements                 |
| Decision reason              |
| Recipe/config version        |
| Coordinate frame             |
+----------+-------------------+
           |
           v
+------------------------------+
| Evidence Package             |
| Image refs + metadata        |
| Replay context               |
| Error/fault context          |
+----------+-------------------+
           |
           v
+------------------------------+
| Review / Replay / RCA        |
| Operator + service engineer  |
| Offline engineering analysis |
+------------------------------+

Architectural Principles

1. Never Process an Image Without Context

Bad:

text

Process(image)

Better:

text

Process(InspectionFrame frame)

Where InspectionFrame includes:

text

Image
FrameId
TriggerId
CaptureTime
CameraId
MachinePosition
WorkflowStepId
RecipeVersion

2. Never Apply Result to “Current” Workflow State

Bad:

text

currentRegion.Apply(result)

Better:

text

result.RegionId = frame.RegionId
result.InspectionId = frame.InspectionId

The result must carry its own identity.

3. Separate Acquisition, Processing, UI, and Storage

Bad:

text

Camera callback
    -> process image
    -> update UI
    -> save image
    -> write database

Good:

text

Acquisition stream
    -> bounded processing queue
    -> result stream
    -> UI subscriber
    -> storage subscriber
    -> diagnostics subscriber

This prevents UI or storage from blocking acquisition.

4. Capture Diagnostic Counters Continuously

Important counters:

text

Frames acquired
Frames dropped
Frames duplicated
Trigger count
Trigger/frame mismatch count
Queue depth
Processing latency
Storage latency
Buffer overflow count
Image quality failure count
Alignment failure count

Counters are often more useful than logs during production escalation.

5. Preserve Evidence Before Recovery

Recovery logic must not erase the crime scene.

Bad:

text

Timeout -> reset camera -> retry -> success

Good:

text

Timeout
    -> capture diagnostic snapshot
    -> record camera/queue/trigger state
    -> save related metadata
    -> then reset/retry

PART 11 — Interview / Real-World Talking Points

A strong explanation in an interview could sound like this:

In industrial vision systems, I would not treat a wrong inspection result as only an algorithm problem. The result is the end of a long chain: lighting, optics, camera acquisition, trigger timing, buffering, image quality, processing, alignment, workflow context, UI, and storage. A production-grade design needs correlation IDs, frame IDs, trigger IDs, timestamps, recipe versions, image quality metrics, alignment confidence, and stage timings so that we can reconstruct what happened later.

Another strong version:

The key diagnostic principle is that every inspection result must be explainable. I want to know which image was used, where the machine was, which trigger produced it, which recipe version was active, how good the image quality was, how alignment performed, which rule made the decision, and whether the result can be replayed offline.

Common mistakes software engineers make when entering this domain:

text

They trust pass/fail without evidence.
They treat images as files instead of correlated production events.
They ignore timing and motion context.
They let UI or storage block acquisition.
They do not store recipe/config versions.
They assume replay is possible without metadata.
They log errors but not enough state to diagnose root cause.
They use current workflow state instead of immutable inspection context.

What strong engineers understand:

text

A vision failure is often cross-layer.
The visible symptom is rarely the root cause.
Timing and correlation are as important as image processing.
A valid image can still produce an invalid decision.
Replay requires image + metadata + recipe + calibration.
Diagnostics must be designed before production failures happen.
Evidence must be captured before retry/reset destroys context.

The mindset shift is:

text

Do not only ask:
    "Why did the algorithm fail?"

Ask:
    "Can I reconstruct exactly what the machine saw,
     when it saw it,
     where it was,
     what configuration it used,
     how the pipeline transformed it,
     and why the final decision was made?"

That is the difference between a vision demo and a production industrial vision system.

Streaming Pipelines Dotnet Real World

Vision System Failures & Diagnostics ​

Principal Software Architect View ​

PART 1 — Why Vision Failures Are Hard to Debug ​

PART 2 — Failure Layers in a Vision System ​

PART 3 — Acquisition & Frame Failures ​

1. Dropped Frames ​

2. Duplicated Frames ​

3. Corrupted Frames ​

4. Late Frames ​

5. Trigger Accepted but No Frame Produced ​

PART 4 — Image Quality Failures ​

Common Image Quality Failures ​

Blur / Focus Drift ​

Underexposure / Overexposure ​

Saturation ​

Low Contrast ​

Lighting Non-Uniformity ​

Reflection / Glare ​

Contamination on Optics ​

PART 5 — Synchronization & Correlation Failures ​

Example Timeline ​

Required Correlation Metadata ​

PART 6 — Processing & Inspection Instability ​

Common Causes ​

Why Replay Matters ​

Intermediate Diagnostics ​

PART 7 — Diagnostic Evidence to Capture ​

PART 8 — Replay, Offline Analysis & Reproducibility ​

Replay Flow ​

PART 9 — Real-World Failure Scenarios ​

Scenario 1 — Image Quality Slowly Degrades ​

Scenario 2 — Missing Frames Only at Full Throughput ​

Scenario 3 — Overlay Looks Correct but Stored Result Uses Different Coordinate Frame ​

Scenario 4 — Inspection Fails Only on One Machine ​

Scenario 5 — Replay Gives Different Result Because Recipe Version Was Not Stored ​

Scenario 6 — Result Delayed and Applied to Wrong Wafer Region ​

Scenario 7 — UI Hides Frame Drops by Showing Latest Image Only ​

Scenario 8 — Storage Queue Blocks and Causes Acquisition Backlog ​

PART 10 — Software Design Implications ​

Bad Design ​

Good Design ​

Architectural Principles ​

1. Never Process an Image Without Context ​

2. Never Apply Result to “Current” Workflow State ​

3. Separate Acquisition, Processing, UI, and Storage ​

4. Capture Diagnostic Counters Continuously ​

5. Preserve Evidence Before Recovery ​

PART 11 — Interview / Real-World Talking Points ​

Vision System Failures & Diagnostics

Principal Software Architect View

PART 1 — Why Vision Failures Are Hard to Debug

PART 2 — Failure Layers in a Vision System

PART 3 — Acquisition & Frame Failures

1. Dropped Frames

2. Duplicated Frames

3. Corrupted Frames

4. Late Frames

5. Trigger Accepted but No Frame Produced

PART 4 — Image Quality Failures

Common Image Quality Failures

Blur / Focus Drift

Underexposure / Overexposure

Saturation

Low Contrast

Lighting Non-Uniformity

Reflection / Glare

Contamination on Optics

PART 5 — Synchronization & Correlation Failures

Example Timeline

Required Correlation Metadata

PART 6 — Processing & Inspection Instability

Common Causes

Why Replay Matters

Intermediate Diagnostics

PART 7 — Diagnostic Evidence to Capture

PART 8 — Replay, Offline Analysis & Reproducibility

Replay Flow

PART 9 — Real-World Failure Scenarios

Scenario 1 — Image Quality Slowly Degrades

Scenario 2 — Missing Frames Only at Full Throughput

Scenario 3 — Overlay Looks Correct but Stored Result Uses Different Coordinate Frame

Scenario 4 — Inspection Fails Only on One Machine

Scenario 5 — Replay Gives Different Result Because Recipe Version Was Not Stored

Scenario 6 — Result Delayed and Applied to Wrong Wafer Region

Scenario 7 — UI Hides Frame Drops by Showing Latest Image Only

Scenario 8 — Storage Queue Blocks and Causes Acquisition Backlog

PART 10 — Software Design Implications

Bad Design

Good Design

Architectural Principles

1. Never Process an Image Without Context

2. Never Apply Result to “Current” Workflow State

3. Separate Acquisition, Processing, UI, and Storage

4. Capture Diagnostic Counters Continuously

5. Preserve Evidence Before Recovery

PART 11 — Interview / Real-World Talking Points