Data Collection Process for a Machine Learning Algorithm - Lemberg Solutions
8 minutes

Data Collection Process for a Machine Learning Algorithm

If you want to create an activity tracking application like iPhone Health or Google Fit, you probably know that they use data science software development to detect people’s motions throughout the day. Creating a brand-new app equals developing algorithms from scratch, and this is a stage where many businesses struggle.

Therefore, we’ve decided to share our experience of launching HorseAnalytics, an application that uses data science algorithms to recognize and evaluate the activity of horses. 

So here is how your the data collection process would look:

  1. Create a data collection plan.
  2. Organize a team to gather data.
  3. Organize tools to gather data.
  4. Always review the gathering process.
  5. Prepare a toolkit for data preprocessing.
If you are new to product development, read our article on How to Manage Your Remote Product Development Team.

Where do you begin?

The first thing you should do before developing any data science algorithm is to define your desired goals. In our case, these were the types of activities we wanted to track. In the HorseAnalytics project, we aimed to distinguish four basic training activities of a horse: standing, walking, trot and gallop.

To teach an algorithm to recognize any activity, you need to give it the right data. Being the key ingredient, data flows through a neural network until it starts finding patterns and drawing conclusions based on the similarities.

You should keep in mind, that only high-quality data can allow you to build a correct info model. But here’s the thing: when you’re developing a unique application, you most likely won’t find a structured database or, in some cases, any form of records at all. That was what we faced developing a machine learning algorithm for Horse Analytics, and that’s when you start collecting data by yourself.

Create a data collection plan

Before you start collecting information, you need to develop a plan that describes which data you’ll need to collect, its required amount and the subjects of data gathering (in the case with HorseAnalytics, they were riders, horses, and their attributes). You should also understand the maximum and minimum amounts of data you need. The person responsible for all these requirements is your data scientist.

With every next stage of the investigation, you’ll have a better understanding of what you should include and what to exclude from the plan. You’ll see that some data only creates noise that has no value for analytics, and other data improves accuracy. So, it’s a good idea to regularly review and alter the plan according to your individual cases.

Organize a team to gather data

When it comes to data, it’s very important to have a team of professionals you can trust; someone who understands the importance of gathering the right data. They should know that flow violation leads to the corruption of the data gathered, so it would be their responsibility to observe the data gathering flow and make notes about any issues that had occurred.

These data scientists should be able to work proactively, with minimum supervision, so you could delegate the work later. Having a team that helps you scale and engage new people without your direct involvement is priceless.

The scheme below demonstrates how we set up the team for gathering data in Horse Analytics project.


Organize tools to gather data

You need specific hardware and software tools to gather data, and these tools would depend on the project you have. In our case, the hardware was a mobile phone and a special HorseAnalytics blanket that had to be put under the saddle. We placed the phone inside a pocket on the blanket and let the horse do its routine.

It’s worth to remember that each hardware gathers data in a different way. For example, you might notice some deviation when you compare data from two diverse device models since they might be equipped with different sensors. We used a few smartphones in the gathering process to avoid this and achieve better data accuracy.

It’s important to be consistent when you’re collecting data. We tried to stay as accurate as we could by always placing the device the same way in the exact pocket to follow the same approach to data recording every gathering session.

Before each data gathering session, we made sure the battery was full, there was enough free space, the previously gathered data has been downloaded and the required software was working correctly. We also had hands-free devices and a fully-charged power bank with us. And in case something went wrong with our main piece of hardware, a mobile phone, we had a spare one.

As far as software is concerned, we used mobile phones that ran on Android and iOS and checked the software regularly to verify that we had only raw data, no data processing took place and the sensors generated data within the correct range.

Always review the gathering process

While collecting data is an automatic activity, its values depend on human factors.  Make sure, that all members of your data science team are on the same page. You also have to manage other people involved in the process – the riders in our case. It was crucial for us to check where they placed the device, how they launched the application and whether the riders followed the plan or not.

The two best hacks we’ve developed in the process were inviting a trainer who could give us some feedback on the training quality and recording the training on camera. The latter can really help you out when you don’t understand why data from a particular rider differs so much from the rest.

Be prepared for low efficiency during the first iterations

In the beginning, everything happens very slowly and that’s okay since the process is new to everybody. You might spend some time checking every single step of data gathering in the beginning. For example, our riders might forget to launch or stop the app, fail to follow instruction during the actual training. On top of that, they might need some time to get accustomed to the fact that someone is talking with them during the training and that was not their trainer. And don’t forget that horses are pretty unpredictable: when they got bored or distracted, riders would perform completely different tasks than we had planned.

We asked trainers to prepare programs every rider should complete, so they also needed time to give every program a try and get familiar with the software that gathers data. We had short Q&A sessions after the training that helped the riders clear everything out and a series of test rides to try out the software and the hardware they will be using.

But don’t worry, once you’re past the initial “trial-and-error” stage, the process of collecting data will go much quicker and smoother.

Always review the gathered data

Putting tons of time and effort into gathering data just to later find out that it’s spoiled is a nightmare for every data scientist. Anything can go wrong: some sensors may stop working or not work at all, while others may cause anomalies. That’s why you'd better always review the data you received and try to notice any issues quickly to fix them right away.

There was a couple of them in our case. Riding type transitions (eg. from standing to walking or from walking to trotting) produced noise data. So, to remove it from a recorded session (e.g. one minute of walking), the horse had to stand still for five seconds before and after the session. Also, a chart of recorded “standing” didn’t look like standing at all, so we found the reasons why this kept happening and removed this session from the data pool.

Prepare a toolkit for data preprocessing

If you want to immediately receive feedback on the issues with the data gathering toolkit, preprocessing is a must. To see if the data you’ve collected is correct or not, you can try checking the number of corrupted vectors (null, null, null) and duplicate vectors (data) or running data through the network (beta version of the network) and verifying the authenticity of recognition.

Here are a few tips you can use to analyze data for accuracy:

  • build graphs and visually compare data (axis directions, ranges) at the initial stage;
  • get a separate program to analyze data accuracy (the ability to work with large volumes of information), preferably one that could emulate a real workout. Be sure to check the number of duplicates, the number of corrupted vectors and whether the device has been placed correctly (axis direction + or -);
  • take photos and videos to see if the exercises were performed correctly;
  • invite experts (professional riders) at the data validation and analysis stages;
  • check that the required sensors are actually built into your piece of hardware (mobile phone).

The takeaways

It took us six months to gather and process the data needed to train a neural network to distinguish a horse standing, walking, trotting and galloping. And it turned out beautifully. 

And here’s a little TL;DR for you:

  1. The setup process of data gathering is time-consuming, so be sure to make some room for it while planning.
  2. Training people according to your needs will require patience. Lots of patience. And repetition.
  3. Regular data gathering audit is a must.
  4. Review data frequently.
  5. Always review the data gathering plan and don’t be afraid to alter it.
  6. Pay close attention to the feedback from data scientists since it helps to keep data gathering tools and programs up to date.

We hope our tips on collecting the right data will help you use data science to propel your product or even your entire business.If you do, however, need help with that, we are here to help you out in gathering data, designing algorithms and training a neural network for your individual project. Contact us today!

Article Contents: