Computer Vision
Computer vision is a field of artificial intelligence (AI) that enables computers to interpret and understand the visual world 1 in the same way that humans do. Computer vision systems use machine learning algorithms to analyze images and videos, and extract information from them. This information can be used for a variety of tasks, such as object detection, image classification, and facial recognition.
Process
[1] Input Image/Video
↓
[2] Preprocessing
- Resize, Normalize
- Format conversion
↓
[3] Object Detection Model (e.g., YOLO, SSD, Faster R-CNN)
↓
[4] Raw Detections (Bounding Boxes + Class IDs + Scores)
↓
[5] Post-Processing
- Non-Maximum Suppression (NMS)
- Threshold filtering
↓
[6] Text Data Generation
- Class ID → Human-readable label (e.g., "car", "person")
- Metadata formatting (JSON/XML/CSV)
↓
[7] Output Integration
- Display (on image/video)
- Export to file/database/API
- Logging or further analytics
Advanced Processing
-
Tracking (for video) → Assign IDs over time + label
-
Captioning → Generate sentences like “A person is riding a bicycle.”
-
Scene Graphs → Understand relationships: person → riding → bike
1
Object detection
Object detection is a computer vision task that involves identifying and locating objects in images or videos. Object detection models are trained on large datasets of images that have been annotated with bounding boxes around the objects of interest. Once trained, these models can be used to detect objects in new images or videos.
-
Image acquisition: This is the process of capturing images or videos.
-
Preprocessing: This is the process of cleaning and preparing images for analysis.
-
Feature extraction: This is the process of extracting features from images.
-
Classification: This is the process of classifying images into different categories.
2
Labelling Annotation
Is the process of manually labelling objects in images or videos. This is a time-consuming task, but it is essential for training object detection models. There are a number of different annotation tools available, such as LabelImg and VGG Image Annotator.
-
LabelImg: This is a graphical image annotation tool that can be used to create bounding boxes around objects in images.
-
VGG Image Annotator: This is a web-based image annotation tool that can be used to create bounding boxes, polygons, and other annotations.
3
COCO Metrics
COCO metrics are a set of evaluation metrics that are used to measure the performance of object detection models. These metrics include average precision (AP), average recall (AR), and mean average precision (mAP).
-
Average precision (AP): This is the area under the precision-recall curve.
-
Average recall (AR): This is the average number of true positives divided by the total number of ground truth objects.
-
Mean average precision (mAP): This is the average AP over all object classes.
4
Model Section
Data quality: The quality of the data used to train computer vision models is critical. Data should be clean, accurate, and representative of the real world.
Model selection: The choice of model is important. Different models have different strengths and weaknesses, so it is important to choose a model that is appropriate for the task at hand.
Hyperparameter tuning: The hyperparameters of a model can have a significant impact on its performance. It is important to tune the hyperparameters carefully to get the best results.
-
R-CNN: This is a two-stage object detection model that first proposes regions of interest (ROIs) and then classifies them.
-
Fast R-CNN: This is an improved version of R-CNN that is faster and more accurate.
-
Faster R-CNN: This is an even faster and more accurate version of Fast R-CNN.
-
SSD: This is a single-stage object detection model that is faster than R-CNN.
-
YOLO: This is another single-stage object detection model that is even faster than SSD.

Get in Touch
Together, let's foster innovation & Success.