What is Data
Labeling Process?

Data labeling or data annotation generally refers to the process of preparing ground truth datasets to be used for training or validating machine learning models. In today's world, companies use a combination of labeling tools, data pipelines & human annotators to label the data.

Steps Involved In The Data Labeling Process

Data Preparation For Labeling

Define Problem Statement

  • Labeling type required
    (eg. Sematic segmentation).
  • Objects/ features to be labeled (eg. Cars).
  • Scenarios to be considered (eg. Snowy or night scene).
  • Amount of data to be labeled.

Collect Data

  • Collect data for scenarios and features of interest.

Select Data

Setting Up Labeling Tasks

Define Labeling Guidelines

  • Create simple and objective labeling instructions for annotators.

Choose
Annotation Tool

  • Supporting annotation type.
  • Easy integration with data pipeline.
  • Ease of use for annotators.

Configure Annotation Tool

  • Set up Classes
  • Set up Attributes
  • Configure the tool to visualize raw data (eg. Images, 3D Point cloud..)

Hire Skilled Workforce

    Train Workforce

    • Prepare training material.
    • Train and allocate based on proficiency.

    Labeling Process

      Make Annotations

        Mistakes detected in labeling

        Check Annotations

        • Manual and automated checks.

        Sampling for QC

          Data Transformation

          Data Transformation (for model ingestion)

          Train Model

          Test Model

          Project Management

          Edge Cases Handling

          • If edge cases are found in the process of labeling, the policies & guidelins are updated to support such cases.

          Workforce And
          Task manager

          • To track workforce productivity and accuracy.
          • Monitor project progress and resolve blockers.

          5 Main Components of Data Labeling

          After working with 100+ enterprises and tech companies we've learned the best possibles ways to combine technology and humans to optimize data labeling operations. Here are five essential elements you’d want to consider when you need to label data for creating ground truth datasets :

          1. Annotation Tools

          What a paintbrush is to an artist, annotation tools are to any data labeling process. It's an essential pre-requisite. Annotations help give meaning to raw unstructured data. You need to either build these tools yourself or buy it from a third party. Based on where you are in your data labeling journey, you might have different requirements.

          How to choose the data labeling tools?

          a) Annotation tools

          Based on your problem statement you're working on, you would know whether you want to detect objects in an image, track joints of a human or identify the sentiment of a tweet. There are specialized tools available for various annotation types. You should preferably choose a tool that supports your out-of-the-box requirements.

          b) Ease of integration

          As your data labeling process grows, you would develop integrations with annotation tools in various parts of the pipeline. Choosing a tool that uses modern web technologies and which provides clean integrations will make your life much easier. It might sound like the obvious thing to do, but we've seen hard drives shipped across the Atlantic just to get data labeled (no judgments made).

          c) Ease of use for annotators

          Annotators spend a huge part of their working day on the annotation tools. Hence, the user-experience of the tools can't be emphasized enough. A tool where making an annotation is a quick & snappy process can help you build your datasets in less time and with fewer people.

          Levels of Data Labeling Automation

          Level 0 No Automation

          The annotator manually performs all the labeling tasks.

          Level 1 Tool Assistance

          The human annotators are assisted by superpixel annotation by annotating group of pixels at a time, interpolation models to perform labeling tasks faster without intensive manual effort for every single frame or annotation.

          Level 2 Partly Automated Labeling

          The human annotators are assisted by machine learning models to auto-detect objects of interest whereas human-only performs the verification or editing of the AI detected objects.

          Level 3 Highly Automated Labeling

          Majority of the data is pre-labeled. The annotators must remain able to perform QC and suggest edits.

          2. Quality Assurance

          Quality of the labels is the most critical piece when it comes to data labeling. It's a function of annotation tools, ambiguity in annotation guidelines, the expertise of the workforce, quality assurance workflows and sometimes depends on the type or, diversity of the data itself. Optimizing for the highest quality with available resources is a continuous effort, just like running a high-grade assembly line.

          How is quality measured in data labeling?

          The first step in assuring quality is measuring it. If you think of data labeling as an assembly line process, quality needs to be measured at each step in the assembly line. This can be done using various methods:

          a) Test questions

          The annotator's output can be compared against a curated set of correct test questions that have perfect annotations. For most geometric annotations required in computer vision, any two annotations can be compared using an IoU score.

          b) Heuristic checks

          Statistical analysis of annotations can flag outliers that creep in due to human misjudgment. For example, when labeling pedestrians in a point cloud, you can't have a pedestrian who is 10 feet tall.

          c) Sample quality check (QC)

          A subset of annotations are sampled and carefully reviewed by expert annotators. Based on the number of true positives, false positives and false negatives identified by the reviewers, metrics like precision and recall can be calculated.
          Measuring quality will help you find out if the labeled data is good enough to be fed into your models. It'll also help you catch the type of errors that are commonly being made so that you can provide relevant feedback to your annotators. This feedback loop trains the annotators in a way that's not-so-different from the models that you are training.

          3. Workforce Model

          As your model consumes more data and it's accuracy improves, you'll discover new edge cases and your model will have to learn new features. The need for labeled data can grow pretty fast and managing such a process yourself can soon become overwhelming.

          How do I know when it's time to scale and hire a data labeling service?

          If your most expensive resources like data scientists or engineers are spending 60-70% of their precious time on datasets for training, you’re ready to consider scaling with a data labeling service. Increases in data labeling volume, whether they happen over weeks or months, will become increasingly difficult to manage in-house. They also drain the time and focus of some of your most expensive human resources: Data scientists and machine learning engineers. If your data scientist is labeling or wrangling data, you’re paying up to $90 an hour. It’s better to free up such a high-value resource for more strategic and analytical work that will extract business value from your data.
          Percentage of time allocated to tasks in a machine learning project

          5 Steps To Scaling Data Labeling Functions

          Most of steps involved in data labeling drain the time and focus of some of your most expensive human resources i.e. data scientists and machine learning engineers. If they are spending most of their time labeling or wrangling data, you’re paying up to $90 an hour for it, which is crazy! It’s better to free up such a high-value resource to focus on the things they are good at science and engineering. Here's how you can scale your data labeling functions more effectively:

          1. Workforce Strength

          Depending upon the data volume, conclude the workforce strength.

          2. Workforce Elasticity

          Depending on the frequency of labelling, space out and allocate workforce.

          3. Workforce Hiring

          Get annotators on board after understanding point 1 and 2.

          4. Workforce Quality

          Measure annotator productivity in terms of speed and accuracy.
          Being data-driven can help you greatly optimize your operations.

          5. Enabling Feedback Mechanism

          Streamline the feedback and review processes with the data labeling teams.

          4. Pricing

          Budgeting for labeling project(s) may get complex at times due to high variance as no two projects can be considered as very similar. Slight variation in data type, annotation type, number of classes, quality parameters, speed, or volume of data and ease of automation can influence the pricing drastically.

          4 Critical Price Considerations For Data Labeling

          1. Project Duration

          Is it a one-time project or a long-term recurring project?

          2. Quality, Turn-Around Time, Cost

          Rank these in order of importance for your project as at least one might have to be compromised.

          3. Pricing Model

          Evaluate if paying per hour or paying per annotation works better for you.

          4. Internal Costs

          Which part of the data labeling process do you want to carry out internally and which projects require outsourcing efforts? Anything you do internally would also incur significant costs.

          5. Security - How will my data be protected?

          If data security is a factor in your machine learning process, your data labeling service must have a facility where the work can be done securely, the right training, policies, and processes in place.

          3 Aspects of Data Security

          Most importantly, your data labeling service must respect data the way you and your organization do. They also should have a documented data security approach in all of these three areas:

          1. People and Workforce

          This could include background checks for annotators and may require to sign a non-disclosure agreement (NDA) or similar document outlining your data security requirements. The workforce could be managed or measured for compliance. It may include annotator training on security protocols related to your data.

          2. Technology and Network

          Annotators may be required to turn off devices they bring into the workplace, such as a mobile phone or tablet. Download or storage features may be disabled on devices annotators use to label data. There's likely to be significantly enhanced network security.

          3. Facilities and Workspace

          Annotators may sit in a space that blocks others from viewing their work. They may work in a secure location, with badged access that allows only authorized personnel to enter the building or room where data is being labeled. Video monitoring may be used to enhance physical security for the building and the room where work is done.

          GDPR Compliance Requirements

          To comply with GDPR, data collected in the EU can be sent outside the EU only if all personally identifiable information is removed. For visual data, this can be done by blurring out identifiers like faces, vehicle number plates, etc. Thankfully, this isn't as tedious as it sounds since there exist solutions that can automatically take care of anonymizing such data.

          Security concerns shouldn’t stop you from using a data labeling service that will free up you and your team to focus on the most innovative and strategic part of machine learning: model training, tuning, and algorithm development.

          Critical questions you should ask your data labeling partner