Evaluate Models

Inferium’s model evaluation framework ensures that users have access to high-quality and reliable AI models. This process involves adaptive metrics, human evaluation, and a comprehensive scoring system to provide an in-depth analysis of each model's performance. Here's how Inferium evaluates models:

Adaptive Metrics

Model-Specific Metrics: Inferium uses adaptive metrics that are specifically tailored to each type of AI model. For instance, classification models are evaluated using metrics like precision, recall, and F1-score, while regression models are assessed using mean squared error (MSE) and mean absolute error (MAE).

Batch Daily Inference: Gathering data and running daily inference tasks, Inferium continuously evaluates models against these metrics. This ongoing assessment ensures that the models are consistently performing well and adapting to any changes in data or usage patterns.

Human Evaluation

Review, Feedback and Rank Models: Users can deploy and test models. They can also provide feedback and rate models based on their experience. By running multi-models as the same time, users can rank their performance, result to score models. These scores will be used to calculated Inferenced Score of models.

Tournament and Crowsourced Judging: Inferium organizes tournaments where developers can compete with each others by showcasing their models. By doing this, Inferium not only fosters innovation but also provide valuable performance data for Inferium ML.

Similar to CodeHawk, Inferium also involves expert judges who meet the specific eligibility criteria to participate in evaluation. Depend on each model tournament, judges can score model by various metrics.

For example: LLM Models will be judged by Relevance - Fluency - Coherence - Consistency - Sematics - Simplicity - Grammaticality

Judge Eligibility

Participated in previous tournaments and awarded a prize
Having models that have more than 50K usages and avg.rate>4
Strong education background and experience (Linkedin, Patent cross-check)
Pass Inferium 's code challenges

Scoring System

Total Score = x×Author Score+y×Machine Score +z×Human Score

Author score: This score is self-assigned by the model’s author upon submission. It reflects the author's confidence in their model based on their testing and validation.

Machine score: This is the score assigned by the system, based on popular public datasets and the system's private datasets. For each problem, there will be a specific evaluation metric, and the score is calculated based on the results compared to the state of the art (SOTA - the highest recognized result in the world to date).

Human score: This score is obtained when users run the model with their own datasets. It provides a real-world performance metric based on practical use cases.

In an ideal scenario, the Author Score and Machine Score should be closely aligned. Significant discrepancies indicate either overfitting by the author’s model or potential inaccuracies in the benchmarking system, prompting feedback and necessary adjustments.

All models, no matter what domain it is, can be grouped into these problems below.

Scoring Criteria

1. Machine Learning

1.1 Classification

Confusion Matrix
Accuracy
Precision
Recall
F1-Score

Accuracy is often the first metric evaluated because it's straightforward and easy to interpret, especially in balanced datasets.

F1-Score balances precision and recall, especially useful for imbalanced datasets

1.2 Regression: Predicting a continuous value based on input features.

Mean Absolute Error (MAE): Measures the average magnitude of errors between predicted and actual values.
Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.
Root Mean Squared Error (RMSE): The square root of MSE, providing error in the same units as the output variable.
R-Squared (R²): Represents the proportion of variance in the dependent variable that is predictable from the independent variables.

Mean Squared Error (MSE)

1.3 Clustering: Grouping similar data points together without predefined labels.

Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters.
Davies-Bouldin Index: Evaluates cluster validity by comparing the average similarity ratio of each cluster with its most similar cluster.
Adjusted Rand Index (ARI): Measures the similarity between ground truth and clustering results, adjusted for chance.

Silhouette Score: Useful for evaluating the quality of clustering by examining both cohesion (how close the points are within the same cluster) and separation (how far apart the clusters are).

2. Computer vision domain

2.1. Image Classification: Labeling the entire image.

Confusion Matrix
Accuracy
Precision
Recall
F1-Score

Accuracy or F1-Score

2.2. Object Detection and Recognition: Identifying and localizing objects within an image.

Bounding box evaluation:
- IoU (Intersection over Union): Measures the overlap between the predicted bounding box and the ground truth box.
Class evaluation:
- Mean Average Precision (mAP): The mean of the average precision values across all classes.
- Precision-Recall Curve: A plot showing the trade-off between precision and recall at different threshold levels.
- F1-Score: Particularly useful when precision and recall are of equal importance.

Mean Average Precision (mAP): a metric used to evaluate almost popular object detection models such as Fast R-CNN, YOLO, Mask R-CNN,… mAP encapsulates the tradeoff between precision and recall and maximizes the effect of both metrics. It is the current benchmark metric used by the computer vision research community to evaluate the robustness of object detection models.

2.3. Image Segmentation: Partitioning an image into regions corresponding to different objects or parts.

Pixel Accuracy/Mean Pixel Accuracy (MPA): The ratio of correctly predicted pixels to the total number of pixels.
Mean Intersection over Union (Mean IoU): The average IoU across all classes.
Dice Coefficient (F1-Score): A measure of overlap between the predicted segmentation and the ground truth.
Boundary F1-Score (BFScore): Evaluates the accuracy of the segmentation boundaries.

Mean IoU provides a balanced view by considering the relative areas of true positives, false positives, and false negatives, unlike metrics like Pixel Accuracy. It ensures that the model's performance is evaluated for every class, making it suitable for datasets with imbalanced class distributions. Mean IoU has become the standard for benchmarking segmentation models in many competitions and research papers, making it easier to compare performance across different models and datasets (Industry Standard)

2.4. Optical Character Recognition (OCR): Detecting and recognizing text in images.

Character Error Rate (CER): The number of incorrect characters divided by the total number of characters in the ground truth.
Word Error Rate (WER): The number of word-level errors (substitutions, deletions, insertions) divided by the total number of words in the ground truth.
Levenshtein Distance: Measures the minimum number of single-character edits required to change one word into another.

CER/WER: provide clear, actionable insights into the performance of OCR systems They are essential metrics used in the evaluation of text recognization

2.5. Facial Recognition: Identifying individuals based on facial features.

True Positive Rate (TPR) - Recall: The proportion of true positives among all actual positives.
False Positive Rate (FPR): The proportion of false positives among all actual negatives.
Receiver Operating Characteristic (ROC) Curve: A plot showing the trade-off between TPR and FPR at different thresholds.
Area Under the Curve (AUC): The area under the ROC curve, summarizing the overall performance.

High TPR (Recall) indicates that the system is good at correctly recognizing faces it has seen before or verifying identities.

2.6. Pose Estimation: Detecting the orientation and position of objects or people.

Keypoint Accuracy: The accuracy of detecting keypoints (e.g., joints) in the image.
Mean Per Joint Position Error (MPJPE): The average distance between predicted and ground truth keypoints.
Percentage of Correct Keypoints (PCK): The percentage of keypoints that are within a certain distance from the ground truth.
Normalized Mean Error (NME): Normalized distance between predicted and ground truth keypoints.

PCK provides a clear, quantifiable measure of keypoint prediction accuracy that is easy to understand and widely applicable, making it a valuable metric for evaluating pose estimation performance.

2.7. Image Super-Resolution: Enhancing image resolution

Peak Signal-to-Noise Ratio (PSNR): Measures the ratio between the maximum possible power of a signal and the power of noise that affects the quality of the image.
Structural Similarity Index (SSIM): Measures the similarity between the super-resolved image and the ground truth in terms of structure, luminance, and contrast.
Root Mean Squared Error (RMSE): Measures the differences between values predicted by the model and the actual values.
Visual Information Fidelity (VIF): Evaluates the perceived quality of super-resolved images.

PSNR and SSIM are the most popular metrics for evaluating image super-resolution, with PSNR focusing on pixel-wise accuracy and SSIM on structural similarity.

2.8. Image Generation: Creating new images using generative models.

Inception Score (IS): Measures the quality and diversity of generated images.
Fréchet Inception Distance (FID): Evaluates the distance between the distributions of real and generated images.
Perceptual Similarity: Measures the perceptual similarity between generated and reference images.
Mean Opinion Score (MOS): A human evaluation metric for the perceived quality of generated images.
Structural Similarity Index (SSIM)/ Multiscale Structural Similarity (MS-SSIM): Measures the structural similarity between generated images and reference images, focusing on luminance, contrast, and structure.

IS: Measures the quality and diversity of generated images by evaluating the confidence of a pre-trained Inception model on the generated images. It is calculated using a pre-trained Inception model, making it relatively simple and computationally efficient to implement. It is widely used in evaluating image generation models.

2.9. Image Captioning: Generating descriptive text for images.

BLEU Score: Measures the precision of n-grams in the generated caption compared to a reference caption.
ROUGE Score: Evaluates the overlap of n-grams or subsequences between the generated and reference captions.
METEOR Score: Considers precision, recall, and synonyms for evaluating generated captions.
CIDEr: Measures the consensus in image captioning by comparing the generated caption to multiple reference captions.

BLEU Score: has since become a standard for evaluating various natural language generation tasks, including image captioning is the most popular metric for image captioning due to its simplicity, efficiency, and widespread adoption. It's easy to compute and provides a quick measure of how closely generated captions match reference texts by focusing on n-gram precision

3. Text - NLP domain

3.1 Group 1: Based on Truth Label

a. Sentiment analysis: Determining the sentiment or emotion expressed in a text.

b. Text Classification: Assigning a category or label to a given text.

c. Named Entity Recognition (NER): extract entities in a piece of text into predefined categories such as personal names, organizations, locations, and quantities

d. Part-of-Speech (POS) Tagging: : Identifying and classifying entities (e.g., names, dates, locations) in text.

Accuracy or F1-Score

3.2 Group 2: Text to Text (Seq2Seq) - Image2Text

a. Machine Translation: automates translation between different languages

b. Text generation (Text-to-Text Generation, Image to text Generation)

Output is text

Autocomplete: Predicts what word comes next, and autocomplete
Chatbots
Captioning for image (Detail at 1.9)
…

c. Text Summarization

d. Text-based Question Answering(QA)

BLEU score: comparing n-grams (sequences of words) in a generated text (e.g., machine-translated text, generated caption) to those in one or more reference texts

3.3 Others

a. Language Modeling (Topic modeling)

An unsupervised text mining task that takes a corpus of documents and discovers abstract topics within that corpus

Perplexity: a metric commonly used in natural language processing to evaluate the quality of language models, particularly in the context of text generation Perplexity quantifies how well a language model can predict the next word in a sequence of words. It is calculated based on the probability distribution of words generated by the model

b. Visual Question Answering - VQA - Answering open-ended questions based on an image

VQAScore: measures the accuracy and relevance of answers generated by a VQA model in the context of the questions and images provided. It evaluates how closely the model's answers align with human-provided answers.

Task

Score

Download/Clone

1 point

Deploy on Space

1 point

Good Open Feedback

1 point

Avg. Rating >3

3 points

Voting 1st/2nd/3rd Best on Comparison

3-1 points

Crowdsourced Judging: 1st place in a Tournament

100 points

Crowdsourced Judging: Top 5 in a Tournament

50 points

All these metrics will be normalized to have a common scale from 0 to 100.

PreviousSearching Models NextProof of Inference

Last updated 10 months ago