Towards sustainable technology: “green” approaches to NLP

We are in a climate crisis caused by humans emitting too much carbon into the atmosphere. There are many factors contributing to this issue. One thing is clear…each person, community, and company can help lessen this impact.

Right now a “bigger is better” mindset pervades the AI community – even when bigger models don’t lead to better results – and the technology’s carbon emissions are seldom calculated and often hidden. A 2019 study estimated that the carbon footprint of training a single Large Language Model is equivalent to 125 round-trip flights between New York and Beijing.

In this session we will review the performance results obtained from the lightweight models used to analyze legal contracts, in particular the excellent results obtained with the hybrid AI approach of expert.ai. We’ll see how high levels of accuracy can be obtained using lightweight models and simultaneously reduce energy consumption and pollution produced by 100 times in the training phase and about 25% less in the prediction phase.

Transcript:

Brian Munz:

Hey everyone and welcome once again to the NLP Stream, which is a weekly-ish, sometimes we skip a week here and there, which we did last week. But it’s a weekly-ish live stream about all things NLP. My name is Brian Munz and I work for Expert AI as a product manager for the APIs. And every week we try to have on people who are well-versed in the world of NLP and AI to dive into topics that are relevant. So we just always want to have new content exploring this whole world. And so if you’ve come and seen us in the past, you will probably recognize the person who’s speaking today because I think this maybe is your third or fourth. Third.

Samuel Algherini:

Third.

Brian Munz:

So you’re in the three timers club, which is an exclusive club. I think you, and there may be another one. But yeah, so this week it’s really interesting topic, something that I’m always very interested in around environmental impacts of AI, especially around NLP. So, thanks for joining and take it away Samuel.

Samuel Algherini:

Hi Brian, hi everyone. Thanks a lot for the introduction. Well as the first thing, do you hear me clearly? Yeah?

Brian Munz:

Yep.

Samuel Algherini:

Yeah, cool. So let me say that I’m very happy to be here and have the opportunity to talk about this very interesting topic that I think too many time it has been overlooked. And today we try to show, to answer the question that is how much more the AI models that we see today are polluting with respect to lighter models. So I mean everyone know that AI is polluting, but no one really know exactly, I didn’t see a lot of comparison in the literature. So I would like to say a couple of things. One thing is that we have written an article, me and Leonardo Rigutini, that is the person that I would like to say thanks because I’ve learned a lot of things from him. And you can see the things that I’m going to talk about in this speech in the article that you will find in the description.

Samuel Algherini:

And yeah, let’s see the experiment and let’s see how these AI models that we have today are polluting with respect to other lighter models. And let’s see this comparison. Few seconds that I share the screen. Okay, one second.

Samuel Algherini:

Brian, tell me.

Brian Munz:

You’ve got it.

Samuel Algherini:

Okay, yeah. Do you see the screen?

Brian Munz:

Yep.

Samuel Algherini:

Okay, wow. Okay. I would like to say something. The reason why I think that the green, the sustainability, why I think that sometimes a lot of people talk about it, but then I think it is overlooked and people are not really doing a lot of things about it. And it is because, of course, it is normal that a lot of people look at metrics. When we are dealing with AI, with models, with something, we are looking about, how good is that model about performances? Is it performing well? What about the precision? What about the recall? What about accuracy? What about F1 score? All these metrics are necessary and are important because you want to get the idea of how the model is performing and how good is it to do something.

Samuel Algherini:

And so when you have high scores on these things, you of course are dancing and you’re happy. But the question is, what is there behind all these metrics and how did you get there? Because there are a lot of issues that a lot of people are not aware of them. And for example, that the AI model needs a lot of data. If you do not provide data that represents the things that you want to create a model about, if you provide data that are full of noise and are not representative, you are not going to create a good model. So if you want to create a good model, you need a lot of data and a lot of data that have a lot of quality. For example, explainability. So it doesn’t make really sense to have high accuracy if you’re looking for a model that is explainable, but the model is not expendable itself.

Samuel Algherini:

Let’s say for example that you’re dealing with healthcare and medical treatment and there is an algorithm that gives you an output and it denies the treatment to you, or a bank loan, and you want to know why you cannot get it, why you cannot afford that treatment. You cannot get it. And I think that you’re not satisfied if you get as answer that there are some features in the model that plays a role, an important role, in the outcome. I think it’s not enough, you want to know why it cannot get that treatment for example. Well, so it doesn’t really make sense to look at precision, recall, accuracy, if you are looking for a sample to a model that should provide explainability.

Samuel Algherini:

So when you look at metrics, metrics are very important, of course performance is very important, but you should also see what these entails, what is tied to this kind of results? There are cases of data shifts, case of computational costs, there are a lot of things in this AI world that we must consider. And yes, biased dataset. And today we are going to add another, one more, that is environmental sustainability.

Samuel Algherini:

One might ask, but why should we raise an environmental issue around AI? Why? The answer is the following. In the recent years, we’re not talking about 20 years, decades, but in few years there has been a diffusion of what we can call, what the literature calls, large language models. And we are referring to models that consist of millions and especially billions of parameters. This is the reason why they are large language models. And to give you an idea, here the record-growth because in few years this number have increased exponentially and all these parameters are not running for free. They have a computational cost and they are consuming energy that are a series of GPU that are working for make this model run.

Samuel Algherini:

As you can see this picture in very few years, we’re talking about three years from the end of 2018, will birth 340 millions parameter. And if we look at the end of 2021, so in just three years we have gone from a few hundred million to a few hundred billion parameters in three years. So it’s not something that is happening from decades or well, of course the model were increasing in these decades, but these last few years this model had been exposed, there’s been an explosion of the number of parameters. And that’s the reason why new questions must be raised that is, how much are they producing?

Samuel Algherini:

So two things, two questions. One is, how much more do large language models pollute than light hybrid machine learning models? And then I will explain what I mean with light hybrid machine learning models. So we have this question that is related to the environment, how much more do they pollute? The other question is relative to the performance, is it possible to have models that are not polluting so much, are not consuming so much energy, but at the same times are able to return to provide good performances?

Samuel Algherini:

So in this case we have run an experiment and we have tried to see if this was possible. Of course, it is one experiment, we are doing more. But let’s see what happens in this case. And one thing that we would like to add is that these huge models are pre-trained on general proposals. So the idea is, in business case, in every business there is a specific domain of application. So if you are dealing with healthcare, we have your proper data, your proper kind of data. The same is true with financial, with insurance, with legal, everyone has it’s own specific kind of data he has to work with.

Samuel Algherini:

So let’s see what happens when we are dealing with some vertical domain and for this case we have chosen the legal domain. The experiment is relative to a classification task. Classification task it means that we are creating a model that is able to categorize, to classify, that means to receive an input and categorize it as belonging to one of the categories of the taxonomy. It means that there are a lot of categories that the algorithm has been traded with and the model must be able, once it has been trained to understand, if a future input will belong to one of the categories that are in the taxonomy.

Samuel Algherini:

In this classification task we are dealing with the legal domain and we have taken the LexGLUE benchmark, is a fabulous benchmark in the literature, it is a collection of seven deficits and the one that we focus about is the LEDGAR. This one, LEDGAR. And more specifically it consists of 80,000 clauses that are extracted from contracts. There are contracts that have something between 100 and 300 clauses. So there is an average of 200 clauses per contract and the legal sources are from the US Security Exchange Commission. And these dataset is a taxonomy of 100 categories. It means that we are dealing with a dataset with 80,000 clauses and 100 categories.

Samuel Algherini:

The model that is trained here is a model that thanks to these clauses should recognize if they belongs to one of these one other categories. And the split, following the literature, is 60,000 for the training, 10,000 for the motivation, and 10,000 for the test set. So this is the data set that we have used and the experiment that we reproduce. But which kind of models do we provide for this experiment? Well, with regards to the large language models, we picked up BERT, that is very famous and very used, large language models for language.

Samuel Algherini:

LegalBERT, it means that it is a pre-trained model that has been pre-trained on legal domains so we have already a vertical pre-model. And then there are also SpaCy bag-of-words and SpaCy CNN, convolutional neural network, that are open-source neural network that you can use and you can fine-tune for your proposals. So here we have these four models that are the fours that belongs to large language models. As light model we use, let’s say, simple sub-vector machine, very good but classic algorithm that is provided with vectors that are from bag-of-wars and TF-IDF that are classic and techniques. Last but not least, of course, we have the expert.ai hybrid model and I would like to spend a couple of minutes, I will not go into deep details because I refer to the article and I suggested you to read the article.

Samuel Algherini:

You will find there some more details but I would like to provide you the idea of the expert.ai hybrid approach because as you can see here we have SVM that is a machine learning algorithm, but here there is this symbolic analysis that is so interesting that makes this hybrid in some sense. Because it uses a symbolic analysis that means it leverages the knowledge graph and the word disambiguation Engine, what are we talking about? The knowledge graph is a huge, semantic net, is a huge representation of the world where try to imagine that there are all nodes and every node might be a concept, an event, an entity, it is something. And all this concept, all this event, all these entities are related to each other through some links, through some arch. And this arch specify a relationship. So we have this huge representation of the world that is done through symbols and it is fundamental in order to help the word disambiguation engine to understand the meanings of something, especially when we are dealing with case cases of ambiguity.

Samuel Algherini:

And so the idea is that we can try to provide to the sub-vector machine, to this algorithm, vectors that can use this knowledge, this knowledge of the world that comes from our understanding of the world because we have created this knowledge graph through decades. And so very briefly, the pipeline is once you extract the linguistic information, you represent symbolically this data thanks to knowledge graph and word disambiguation engine. And then you extract the n-grams and what is interesting is that you vectorize through normal techniques, like bag-of-wars and Inverse TF-IDK, term frequency-inverse document frequencies. But this data, the data that you are vectorizing, should bring with them the quality, the information that comes from the symbolic representation of the word that comes from the knowledge graph and [inaudible 00:16:20]. So thanks to these, the information that we are providing to the support vector machine must be valuable, should be valuable at least.

Samuel Algherini:

So these are the phases of the development of the hybrid models. Yeah. Linguistic analysis, the symbolic representation, the creation of language model modeling and the vectorization of that knowledge, and then the training. And yeah, just a couple of things before we drum roll and we see the results of the experiment. We use a library called codecarbon to see how much energy have been consumed for this test, for this experiment. And a couple of things about hardware configuration, you have this huge model then it computational power. So for simple SVM for our hybrid approach, we use a CPU so you can’t run it on your laptop easily. Instead for the BERT approach, for the large language model, we use a Tesla TA GPU.

Samuel Algherini:

And so let’s see the results. Now let’s look at performance when we look at the environment, even though today is the most important thing. Yeah, I don’t know if maybe it’s surprising for you, it has been also for me, because we see that there are very small differences between SVM single model, expert.ai hybrid model, and LegalBERT. Okay, one second. Let me briefly recap this table and to make this clear. Here we’re looking at micro F measure and macro F measure. And we see that these two metrics tell us that if you do not fine-tune large language models, these large language models are generally, freely, you can find them, they are open source, you can use them, but you need to fine-tune them. It means that you need to train them properly with your data.

Samuel Algherini:

If you do not do that, as you can see, not very good results I would say. We can see that BERT has good results, LegalBERT, good, bag-of-words and CNN there are some results, but not very good, and we see that a simple light model, like SVM and expert.ai hybrid model, it is a bit more complicated but it’s light compared to LegalBERT, are doing very well. Same micro and even a bit better macro, the hybrid approach. So wow that’s very interesting. The performances are comparable, are essentially the same without billions of parameters, we have the same results. This is with regards to performance, but let’s see what happens about the main topic of today that is environment and sustainability. How much more they consume?

Samuel Algherini:

Okay, well we have this table and you should focus on these two columns. If we take BERT as the reference and so we express BERT as 100, we can see that SVM and the expert.ai hybrid model, they are much more cheaper. We can see that BERT is consuming 100 times more than expert.ai and SVM. 100 times more. One thing I should add here is that the table we are looking at is the table that is referring to the experimental phase. It means that here the data you see that refers to the training for the large language model would be fine-tuning, but the training, the validation, and the test. So it means all the time needed for the training and the validation with the 60,000 and 10,000 documents and the test for the 10,000 documents. So this is the most important part. It is the part where you need to consume more energy but that you need, anyway, to do in order to fine-tune your model.

Samuel Algherini:

You do it not a lot of time, you do it one time then you do it another time when you need to retrain your model. But in this case, when you train your model, when you fine-tune your model, if you use large language model, in this case BERT and LegalBERT are consuming 100 times more than hybrid model or light model like SVM. So kilowatt/hour is incredible and as well as CO2, of course. So also the initial are much either the one emitted by BERT and LegalBert. This with regards to the experimental phase.

Samuel Algherini:

But nonetheless, let’s see what happens with the test phase. It means the test phase is the prediction phase, sorry. The prediction phase it is, we are not dealing with the training because now you have your model created, now we are dealing with the moment when you provide document and what is doing the model is predicting if the document belongs to one of the categories in the taxonomy. In this case, this table referred to the results that are relative to 50 contract, legal contract. Every contract has something between 100 and 300 legal clauses, so we consider an average of 200 legal clauses. 200 every contract for 50 contract, it means that we have approximately 10,000 clauses. In this case we can see that the BERT families, BERT and LegalBERT, is consuming four times more than the hybrid expert.ai and the support vector machine light model.

Samuel Algherini:

So it is not 100 times like the experimental phase that is when you train the model, in the prediction phase they are four times more expensive than expert.ai and simple approach. Yeah, of course it is not astonishing like 100, but it’s still an interesting number especially in this time where energy is not so cheap and the planet as really means of our attention. This large language model is consuming four times more energy than support vector machine and the hybrid approach.

Samuel Algherini:

What is really interesting in these studies, at least from my experience and personally, is that four times in the prediction phase, four times is of course a lot, but still in some sense reasonable. But what really is interesting is that in this case with large language model, we’re not get getting a better performance. So in this case, the only difference is that with a lighter model and with a hybrid model, you are getting a model that is faster in training, is good as the larger model in predicting, and is consuming much less energy that in this time, in this period, is important. And couple of things that I would like to add is that of course, this is just one experiment. We need of course more experiment in different field, in different domains in order to understand if these results are consistent.

Samuel Algherini:

And so we are doing that. So this is a good reason in order to stay tuned with us so we can see and we show how these other experiments, which kind of results we will get. Because in this case with the legal domains, we have seen with the [inaudible 00:25:39] dataset that comparing different kind of models, we have seen that light and hybrid expert.ai model are doing well as the large language models, in our case even slightly better, but essentially the same, but consuming much less energy. 100 times in the experimental phase and four times in the prediction phase.

Samuel Algherini:

And this is another good reason I always say to people, do not fall in love with technology, fall in love with the capability to solve your problem. And in this case if we just do not follow the fact to use these huge language models, we have a clear case where a light model and a hybrid model can perform the same with smaller cost, but especially with much less energy consumption that today is very important. I hope that the other experiments that we are running will provide same results and it’s really a hope and we’ll see. So stay tuned and yeah, let’s see what happens.

Brian Munz:

Yeah, I think an interesting point that you make is that there’s no magical solution that solves all problems, right? And so while that can be easy, I think this shows that, of course it makes sense from a business perspective to always consider your use case, but also now it makes sense from a environmental perspective because if you need to pave your driveway and you bring in a giant truck full of, this is a terrible analogy, but you use a jack hammer to drive a nail, I don’t know. But you’re using too much to do a smaller use case while it may sometimes be easier and the performance may be the same, you are using more resources. So I think that’s an interesting point here. So it looks like we have a few questions here. First from Brett Loveday and this was early on and he’s asking the larger centralized language models like GPT-3 are greener than distributed AI, which I think is probably the case, right?

Samuel Algherini:

Yeah, but I wouldn’t say with data. I mean, I would like to have some data in order to tell you how much, but I cannot answer in this way. Sorry.

Brian Munz:

Yeah, it makes sense but I don’t have the actual numbers. Another conversation that came up is Laura Hambridge had mentioned that the classification task is far better than text generation for large language models like Bert and GPT are designed for. And then the hybrid model can be run on a CPU does say a lot about its efficiency. So yeah, I think this is speaking to what you mentioned at the end where more experiments are needed. This is a starting point for a smaller experiment, right?

Samuel Algherini:

Well what I think is that, yeah, it’s true that text generation is something different and today we are waiting for GPT-4. So yeah, of course there is a different effort behind it. What I would like to highlight is that before start with a kind of model, we need to understand which kind of task I need to solve. Sometimes I need huge models, sometimes I don’t. But I think that today it’s very easy to start with huge model when there is not a need. So in case of text generation, it probably they are working much better. But it all always depends what we are looking for and what we are doing. So in case of classification test is a different kind of test. And so text generation is slightly different and then it depends what we are doing because sometimes even GPT-3 is really misunderstanding and answering things that are not very semantically understandable, but…

Brian Munz:

Yep. And another question from Brett. He said he is using Node JS and JavaScript to find patterns in random data. The large data set with this hybrid model work for me? So I mean initially I would say that we’d have to know more about what you mean by random data, et cetera. But generally I would imagine that using a hybrid model would be more effective than simply Node JS and JavaScript.

Samuel Algherini:

Yeah, well yeah. Yeah. We have the opportunity to leverage our knowledge graph, that is something that comes from decades of construction of work on it. So it’s kind of hybrid model, the one you proposed, Brett, but I think that the type of hybrid model I’m talking about is the one who is slightly different because it’s going to leverage the knowledge graph and the [inaudible 00:31:19] that is the core technology and it is the kind of technology that provides some representation that creates this kind of vectorization. So…

Brian Munz:

Yep. And then he kind of followed that up with a nice segue into wanting some links about the hybrid approach and things like that. So what we can do is we can post these links on the LinkedIn and the YouTube just pointing in the right direction, especially if you’re to use expert, we have a free kind of tier you could use to explore it. But yeah, we’ll reach out, put some links on the actual video and then reach out directly if need be, so. But thanks for the questions because-

Samuel Algherini:

Thanks a lot.

Brian Munz:

…very good to see you people are engaging. But yeah, Samuel, thanks for this presentation. This is something that I always find extremely interesting and I think it’s a topic that’s good to keep in people’s minds because as we know one of the laws of technology is that it seems to double in size every few years. So it’s something to that should definitely be on the front of everyone’s minds. So thanks again for presenting and look forward to you being joining the four-timers club, so.

Samuel Algherini:

Thanks Brian. Thanks everyone. Thanks a lot for your time.

Brian Munz:

Great. So next week presumably we’ll be back with another presentation on technology on small devices. So NLP on, I imagine, IOT and other small devices and applications there. So make sure to tune in same time, same place, same day. So well I’ll see you next week and thanks again.