At Codesphere we passionately advocate for using smaller fine-tuned models that are more efficient and despite their size maintain high performance levels. Models with fewer than 7 billion parameters, often called sub-7B models, are particularly attractive for tasks like document categorization.
Before you judge these models based on their smaller size, hear me out. These models are highly effective, thanks to advancements in techniques like model distillation, parameter sharing, and optimized training processes. These models provide an excellent balance between computational efficiency and accuracy which means just about anyone can use them with limited memory and processing power. Later in the article, I will also show you how you can use both discriminative and generative sub 7 b models to classify documents and we will also talk about the best open-source models to do that.
What is document classification?
Before we jump into discussing the best Document classification models letβs discuss what is document classification itself. Document classification is a process in natural language processing (NLP) that involves categorizing text documents into predefined categories or classes based on their content. This task has applications across several industries, involving tasks such as organizing emails into folders, tagging documents by topics, or classifying customer feedback into sentiments. This process involves the use of algorithms that learn from labeled examples and apply this knowledge to categorize new, unseen documents accurately. Effectively classifying documents helps in transforming unstructured textual data into structured information that can be more easily managed, analyzed, and acted upon.
Different Types of Document Classification
Document classification can be of many different types based on the use case.
Topic Classification: It involves assigning documents to categories based on the subject matter they cover, such as sorting news articles into topics like politics, sports, or entertainment.
Sentiment Analysis: This is where you categorize documents according to the sentiment expressed, whether positive, negative, or neutral, which is particularly useful in analyzing customer reviews or social media content.
Intent Classification: Intact classification is used to determine the purpose behind a text, which is crucial in customer service to understand whether a customer is making a complaint, asking a question, or giving feedback.
Entity Extraction: This is where models try to classify and tag entities like names, dates, or locations within documents. This is often used in legal document processing or information extraction from large text corpora.
Why is classification important?
Now the question is we know what document classification can do but what good is it for? Document classification is crucial for a multitude of reasons and I am going to list a few of them for you.
Efficiency: First, it enhances efficiency by automating the organization and retrieval of documents. This in turn reduces the time and effort required to manage large volumes of text manually. This way you ensure that information is always categorized consistently, which is vital for maintaining accuracy across large datasets.
Scalability: Additionally it allows organizations to handle growing amounts of data without having to increase their manpower. This is especially important in the current situation, where the volume of information can be overwhelming which makes it hard for organizations to process large volumes of documents.
Decision Making: By structuring data in a way that is easy to analyze, organizations can derive insights more efficiently, which can influence strategic decisions.
Best sub 7b document classification
When it comes to document classification using AI models, there are two primary approaches: utilizing a discriminative model or a generative model.
1- Use a discriminative model
Discriminative models are typically more direct and efficient, focusing on learning the boundaries between different classes. For pure document categorization with a focus on sub-7B models, MiniLM and DistilBERT are often the best starting points due to their efficiency and solid performance. If you need something even lighter, BERT-tiny or Electra-small can be excellent alternatives, especially for applications where computational resources are very limited.
For demonstration, I am using zero-shot classification with Roberta-large-mnl to discriminate bank documents into four predefined categories. With some prompt engineering, these models should work as efficiently as LLMs or even better. Also, during my own experimentation, I found these to be faster than LLM-based models. If you want to check out how I use roberta-large-mnli for text classification here is the code.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import os
from dotenv import load_dotenv
load_dotenv()
app = FastAPI()
model_id = "roberta-large-mnli"
classifier = pipeline('zero-shot-classification', model=model_id)
candidate_labels = ['Approved Document', 'Application Document', 'Rejection Document', "Missing Documents"]
class TextData(BaseModel):
texts: str
@app.post("/classify")
async def classify(text_data: TextData):
try:
label = classifier(text_data.texts, candidate_labels)
return {"response": label['labels'][0]}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=3000)
2- A little prompt engineering with a Generative Model
The other way to classify documents is using a generative model. With a little prompt engineering, you will have a model that classifies documents in the given categories pretty accurately. I used a ready-to-use template for Llama.cpp we have on Codesphere. It took me some time to get the prompt right but once done, this LLM model was able to classify the text as accurately as our discriminative model, if not better.
You too can try our Llama template on a single click and here is the exact prompt that generated the best result for me.
This is a conversation between User and Llama, a friendly chatbot that helps user classify documents as one of these four categories
1. Approved Document
2. Application Document
3. Rejection Document
4. Missing documents
Your answers should be just one of the four document types mentioned above and nothing else.
I also tried using gbnf for Llama cpp to set the grammar which did not really improve the quality of our classification task.
Wrapping Up
After trying both approaches; I can say each approach has its own strengths and is suited to different tasks or resource environments. The choice between these two depends largely on the specific requirements of the task at hand, such as the need for speed, accuracy, or the ability to handle ambiguous data. However, during my little adventure with sub 7B models for text classification, I found discriminative models to be more robust and fast while with the right prompt LLMs performed as well as the discriminative models.