Random bits about software development: The Watson trainer

These days everybody is busy training Pokemons. But, because of a challenge, I had to spend few hours training something different: Watson. It has been real fun and this is why I decided to share my experience.

Some background on my use case

As I said, I'm working on a challenge that requires extracting knowledge from natural language. I decided to use Watson, and in particular the Alchemy Language service available on IBM Bluemix. The service allows a free trial mode to whoever is interested in testing its capabilities, and registering on Bluemix is really simple. So try it!

The Alchemy Language service offers few different capabilities and I was particularly interested in the "Relation Extraction" function, that allows to recognize the entities appearing in a text, and the relations between them.

The Alchemy Language service is described at:

https://www.ibm.com/watson/developercloud/alchemy-language.html

But the relations are really domain specific. Let's assume as an example a text describing cars: the text can either be an article comparing a bunch of the latest sport cars or a document describing the dynamics of a car accident. In the first case the relations I'm interested at could range from "car A is fastest than car B" to "car A consumes less gasoline than car B". In the second example the relations I could be interested at would be like "car A hit car B and hit car C" causing the accident.

So, depending on the context and the objective of the text analysis, the sets of relations between entities can be really different.

So why do I had to train Watson?

The Relation Extraction function requires a model, that describes the entities and the relations I want Watson to extract from the text. The default model offered by the Alchemy Language service is related to extracting entities and relations from English news; you can try this model here:

https://alchemy-language-demo.mybluemix.net/

But I required a model specifically tailored for software and software specific relations. And this is the reason why I had to train Watson!

The training process: Watson Knowledge Studio

Watson Knowledge Studio is the tool that allows to create a customized knowledge model. The tool offers a trial period, that can be activated at:

https://www.ibm.com/marketplace/cloud/supervised-machine-learning/us/en-us

To start using the tool, I suggest a very nice tutorial, available as a series of YouTube videos:

https://www.youtube.com/playlist?list=PLZDyxLlNKRY8L2q26h_BT1ZBTW0Z6ic94

The tutorial is fast and goes through the whole cycle, from the trial registration, up to the publishing of the model for using it with the Alchemy Language service.

To create my model I had to execute some steps that are a good example of the steps required to create any model. The process is basically intended to manually annotate some sample documents in order to show Watson what you are interested at.

Creating the basics of the model

The first and most important operation is the definition of the entities you are interested at and of the relations linking them. I started without having a clear idea of my model, and I had to modify it few times in order to achieve the right modeling of my domain.

It is basically an iterative process. I created an initial model, then I started applying it to some of the sample documents, and I realized it was not good enough. Thus I started modifying it and the more the documents I looked at, the more the changes I had to apply.

Providing one or more input dictionaries

Once created an entity like a Software, the next step is to teach Watson how to recognize it. There are basically two ways: the first one is to start tagging all the occurrences of a software in the sample documents so that it learns from examples. The second option is to provide Watson a dictionary.

The dictionary is basically a list of words that can represent your entity. In case of a software it could be a list containing many of the existing software products, or at least the ones you are interested at.

A dictionary saves a lot of manual work, because it allows to leverage a pre-annotator to discover and annotate all the occurrences of one or more entities into the training documents.

In my case it has not been possible to provide a dictionary for all the entities. This is anyhow not a show stopper: I had to annotate the missing entities manually.

Performing the manual annotation

This is a tedious phase, even if it is really important to perform it correctly. The first time I executed it, I got some shortcuts by annotating only a subset of the documents and in a quick way. The resulting model was really poor in extracting relations. I had to process the documents a second time: it required some effort to annotate them in an accurate way, but the result was worth the work.

So, during this step, you get the pre-annotated documents and outline, inside of them, all the entities that the pre-annotator missed. Also, this is the time to annotate all the relations between entities.

The graphical editor provides a great help in performing the work: it colors the different entities and relations with different colors and shows graphically, through arrows, which entities are involved in which relations.

The final step is to associate all the occurrences of an entity that are related between them. Assume for example that two Ferrari 488 have an accident. In the report of the accident, there will be multiple occurrences of the words Ferrari and 488, but which of them refer to the hitting car and which of them refer to the car that has been hit? This step is intended to help Watson disambiguate entities and understand how they are grouped together.

The video tutorial does a great job in describing the manual annotation steps.

The annotation process can be suspended at any time, for example to modify the model. I had to suspend it few times; once updated the model, it is possible to update the annotated documents in order to accept the new model and resume the annotation process.

The training and publishing phases

Once the documents have been properly annotated, it is possible to submit them. Watson will analyze their annotations and generate a relation extraction model. The model can be finally exported to the Alchemy Service, in order to use it with the Relation Extraction function.

Results of my work

I spent about 4 hours in the training process. It is not so bad: it was the first time I used the Watson Knowledge Studio, the model I wanted was not clear to me and I had to annotate about 35 medium length documents.

The number of documents is likely very limited for creating a strong, generic model. But I just needed it for the challenge, and I'm really happy of the result.

I used the Alchemy Language service to analyze a new document and it extracted 4 relations: three of them were correct and annotated with a very high accuracy score. One was incorrect, but annotated with a very low accuracy score; so it was incorrect but Watson somehow realized it.

The results are very good for the needs of my challenge and I really enjoyed the time spent in this training.

Random bits about software development

Sunday, September 11, 2016

The Watson trainer