Data annotation is important in machine learning because in many cases, it makes the work of the machine learning program much easier.
This has to do with the difference between supervised and unsupervised machine learning. With supervised machine learning, the training data is already labeled so the machine can understand more about the desired results. For example, if the purpose of the program is to identify cats in images, the system already has a large number of photos tagged as cat or not. It then uses those examples to contrast new data to make its results.
Free Download: Machine Learning and Why It Matters |
With unsupervised machine learning, there are no labels, and so the system has to use attributes and other techniques to identify the cats. Engineers can train the program on recognizing visual features of cats like whiskers or tails, but the process is hardly ever as straightforward as it would be in supervised machine learning where those labels play a very important role.
Data annotation is the process of affixing labels to the training data sets. These can be applied in many different ways – above we talked about binary data annotation – cats or not cats – but other kinds of data annotation are important as well. For example, in the medical field, data annotation may involve tagging specific biological images with tags identifying pathology or disease markers for other medical properties.
Data annotation takes work – and is often done by teams of people – but it is a fundamental part of what makes many machine learning projects function accurately. It provides that initial setup for teaching a program what it needs to learn and how to discriminate against various inputs to come up with accurate outputs.