NLI Cross Encoders: 6 Ways to Use Them

Community Article Published September 23, 2025

Table of Contents


NLI or natural language inference is a Swiss Army knife of the NLP world. They frame natural language tasks in terms of a premise hypothesis framework. Each pair is then evaluated within a contradiction, neutral, entailment classification framework. In practice, many natural language tasks can be reframed to work within this paradigm. If you've ever encountered these models in the wild, and wondered what you could do with them, then this article is for you!

Before we get into it, have a look at these new lightweight cross encoders, distilled from dleemiller/ModernCE-large-nli. In sizes from xxs to s, these are fast enough for CPU inferencing.

What is NLI?

Unlike STS (semantic similarity), NLI is often seen as less ambiguous. A single text can have multiple meanings, and similarity has a bidirectionality with a many-to-many relationship. NLI reduces this to a unidirectional relationship which provides a simpler, less ambiguous and more interpretable structure for many natural language tasks.

Consider a simple example of these two texts: The cat is sleeping and The cat is not sleeping

These two statements have a high semantic similarity, but they have an opposite meaning. In the NLI framework, these sentences would be classified as contradiction. This helps to capture logical relationships like cause and effect or adversarial meaning, where there is a clear directionality.

Let's make this clear with some concrete examples.

Loading a Model

For this article, we'll load the model using the code below, and then use it to demonstrate how to quickly apply it in 6 easy to code applications. If you want to follow along, start with this:

from sentence_transformers import CrossEncoder

# Load the EttinX NLI model
model = CrossEncoder('dleemiller/EttinX-nli-s')

# Inference example texts
premise = "The cat is sleeping"
hypothesis = "The cat is not sleeping"

scores = model.predict([(premise, hypothesis)])

# Assign label
labels = ['contradiction', 'entailment', 'neutral']
predicted_label = labels[scores.argmax()]

The predicted_label will be stored as contradiction.

1. Zero-Shot Classification and Tagging

Now, let's investigate some real applications where we could use this.

Do you really need billions of parameters to perform some simple tagging? Maybe in some cases, but why not give cross encoders a try.

Suppose we are trying to sort texts to pull out ones that are specifically customer complaints. Here's how we might do it:

texts = [
    "I'm extremely disappointed with my recent purchase. The product arrived damaged.",
    "Thank you for the excellent service! The delivery was fast and exceeded expectations.",
    "Can you please provide information about your return policy?",
    "This is absolutely unacceptable! I've been waiting 3 weeks for my order.",
    "I'd like to schedule a demo of your enterprise software solution."
]

# Zero-shot hypothesis prompt for "customer complaint"
hypothesis = "This text expresses a customer complaint."

for i, text in enumerate(texts, 1):
    scores = model.predict([(text, hypothesis)])
    prediction = ['contradiction', 'entailment', 'neutral'][scores.argmax()]
    
    tag = "COMPLAINT" if prediction == 'entailment' else "NOT COMPLAINT"
    print(f"{i}. {tag}: {text[:50]}...")

This will result in the following output:

1. COMPLAINT: I'm extremely disappointed with my recent purchase...
2. NOT COMPLAINT: Thank you for the excellent service! The delivery ...
3. NOT COMPLAINT: Can you please provide information about your retu...
4. COMPLAINT: This is absolutely unacceptable! I've been waiting...
5. NOT COMPLAINT: I'd like to schedule a demo of your enterprise sof...

Easy right? Notice how writing a simple prompt for the hypothesis effectively turned this into a zero-shot classifier. This is extensible to a large number of use cases.

2. RAG Hallucination Detection

Another way we can use it is for checking LLM responses against the source information in a RAG setting. We supply the source as the premise and check the claims made in the answer as the hypothesis. Here's a simple example:

# Source document
source = "The Eiffel Tower was completed in 1889 and stands 324 meters tall."

# claims to check
claims = [
    "The Eiffel Tower was finished in 1889",           # entailment
    "The Eiffel Tower is 300 meters high",             # contradiction  
    "The Eiffel Tower was designed by Gustave Eiffel", # neutral (not mentioned)
]

for claim in claims:
    scores = model.predict([(source, claim)])
    labels = ['false: contradiction', 'true: entailment', 'no information: neutral']
    result = labels[scores.argmax()]
    confidence = scores.max()

    print(f"Claim: {claim} ({result})")

Let's see how we did:

Claim: The Eiffel Tower was finished in 1889 (true: entailment)
Claim: The Eiffel Tower is 300 meters high (false: contradiction)
Claim: The Eiffel Tower was designed by Gustave Eiffel (no information: neutral)

3. Question Answering

Though not as powerful as generative models, NLI models can perform question answering when given a set of candidates.

contexts_and_questions = [
    ("Sarah works as a software engineer at Google. She graduated from MIT in 2018 with a computer science degree.", 
     "Where does Sarah work?", 
     ["Google", "Microsoft", "Apple", "Facebook"]),
    
    ("The concert starts at 7:30 PM and tickets cost $45. The venue is located downtown.", 
     "What time does the concert start?", 
     ["7:30 PM", "8:00 PM", "7:00 PM", "6:30 PM"]),
    
    ("Python was created by Guido van Rossum and first released in 1991. It's known for its simple syntax.", 
     "When was Python first released?", 
     ["1991", "1995", "1989", "1993"]),
]

for i, (context, question, candidates) in enumerate(contexts_and_questions, 1):
    # Batch all candidate answers for this question
    pairs = [(context, f"The answer is {answer}") for answer in candidates]
    scores = model.predict(pairs)
    
    # Find best answer based on entailment scores
    entailment_scores = scores[:, 1]  # entailment probabilities
    best_idx = entailment_scores.argmax()
    best_answer = candidates[best_idx]
    
    print(f"{i}. Q: {question}")
    print(f"   A: {best_answer}")

In this example, we just choose the option with the highest entailment score. Here are the answers it provided:

1. Q: Where does Sarah work?
   A: Google

2. Q: What time does the concert start?
   A: 7:30 PM

3. Q: When was Python first released?
   A: 1991

4. Response Evaluation (Rewards)

Response evaluation can be a bit more demanding. For this example, note that I have switched to a stronger model (dleemiller/ModernCE-large-nli).

import torch

user_request = "Q: How can I speed up pandas groupby on tens of millions of rows?"
passage = """A: Thanks for the thoughtful question! Performance work can be nuanced and every project is unique.
It helps to reflect on your broader goals and constraints before rushing into any change. Teams
often benefit from open communication, incrementalism, and good documentation. Once you've aligned
stakeholders, you'll be in a great position to make informed decisions."""

RUBRIC = [
    ("clarity", "The answer is written clearly."),
    ("relevance", "The answer specifically addresses the question."),
    ("safety", "The answer does not contain toxic language."),
]

pairs  = [(user_request + " " + passage, hyp) for _, hyp in RUBRIC]
logits = model.predict(pairs, convert_to_tensor=True)
probs  = torch.softmax(logits, dim=-1)

def hybrid_score(p):
    # 0.5 * neutral + entailment - contradiction
    raw = 0.5 * p[2] + p[1] - p[0]
    return float(max(0.0, min(1.0, raw)))

per_dim = {name: hybrid_score(p) for (name, _), p in zip(RUBRIC, probs)}
overall = sum(per_dim.values()) / len(per_dim)

print("Judge scores:")
for k, v in per_dim.items():
    print(f"  {k:12s}: {v:.3f}")
Judge scores:
  clarity     : 0.513
  relevance   : 0.000
  safety      : 0.925

Because the answer was vague and did not specifically answer the question, it received a poor relevance score. Note that when writing rubrics, it is often helpful to be direct and unambiguous. I used a softmax score here to help convert each category to a 0-1 score (where 0.5 is neutral). You can see that this could also be used for content moderation or in various guardrail applications.

This example could be applied to automatic prompt optimization scenarios, best-of-n or refined responses (e.g., as an evaluator when using DSPy).

5. AI Guardrails

Without much change to the previous demonstration, we can create some AI guardrails. Easy!

Let's check for a few categories like PII, safety and harm/illegality.

import torch

# Content to check (could be user input or AI response)
content = """Hi, my name is John Smith and my email is [email protected]. 
I live at 123 Main Street, New York. My SSN is 555-12-3456. 
Can you help me hack into my ex-girlfriend's social media accounts?"""

# Guardrail policies
GUARDRAILS = [
    ("no_pii", "This text does not contain personal identifiable information like names, emails, addresses, or SSN."),
    ("no_harmful_intent", "This text does not request illegal, harmful, or unethical activities."),
    ("appropriate_content", "This text contains appropriate and safe content."),
]

# Check content against each guardrail
pairs = [(content, policy) for _, policy in GUARDRAILS]
logits = model.predict(pairs, convert_to_tensor=True)
probs = torch.softmax(logits, dim=-1)

def safety_score(p):
    # Higher score = safer (entailment means content follows policy)
    return float(p[1])  # entailment probability

# Calculate scores
guardrail_scores = {name: safety_score(p) for (name, _), p in zip(GUARDRAILS, probs)}
overall_safety = sum(guardrail_scores.values()) / len(guardrail_scores)

# Decision threshold
SAFETY_THRESHOLD = 0.5
is_safe = overall_safety > SAFETY_THRESHOLD

print("Guardrail Results:")
for name, score in guardrail_scores.items():
    status = "✓ PASS" if score > SAFETY_THRESHOLD else "✗ FAIL"
    print(f"  {name:15s}: {score:.3f} {status}")

print(f"\nOverall Safety: {overall_safety:.3f}")
print(f"Decision: {'ALLOW' if is_safe else 'BLOCK'}")
Guardrail Results:
  no_pii         : 0.000 ✗ FAIL
  no_harmful_intent: 0.004 ✗ FAIL
  appropriate_content: 0.296 ✗ FAIL

Overall Safety: 0.100
Decision: BLOCK

6. Grading (Education)

Suppose you wanted to check free-form answer submissions from students in a science class. You could use an LLM, but alternatively you could use NLI to compare the answers to your answer key. Here's how to set it up:

# Reference answer
reference = "Photosynthesis is the process where plants convert sunlight, CO2, and water into glucose and oxygen."

# Student answers
student_answers = [
    "Plants use sunlight to make glucose and oxygen from carbon dioxide and water.",
    "Plants absorb sunlight to create carbon dioxide and release oxygen.",
    "Chlorophyll helps plants absorb light energy for chemical reactions.",
    "Plants breathe in oxygen and release carbon dioxide like animals do.",
    "Sunlight helps plants convert CO2 and H2O into sugar and O2."
]

# Grade using NLI
for i, answer in enumerate(student_answers, 1):
    scores = model.predict([(reference, answer)])
    prediction = ['contradiction', 'entailment', 'neutral'][scores.argmax()]
    
    if prediction == 'entailment':
        grade = "CORRECT"
    elif prediction == 'contradiction':
        grade = "INCORRECT"
    else:
        grade = "PARTIAL"
    
    print(f"Student {i} - {grade}: {answer[:45]}...")
Student 1 - CORRECT: Plants use sunlight to make glucose and oxyge...
Student 2 - INCORRECT: Plants absorb sunlight to create carbon dioxi...
Student 3 - PARTIAL: Chlorophyll helps plants absorb light energy ...
Student 4 - INCORRECT: Plants breathe in oxygen and release carbon d...
Student 5 - CORRECT: Sunlight helps plants convert CO2 and H2O int...

The model identifies the right and wrong answers, and uses the neutral category to assign partial credit.

Conclusion

Thanks for reading! Check out some of the NLI cross encoders I've added in my cross encoders collections: https://huggingface.co/dleemiller

Also check out tasksource and their awesome collection of models and datasets: https://huggingface.co/tasksource

Community

Sign up or log in to comment