Machine Data Harvesting

Machine data is available in various forms. Temperature sensors, health trackers, and also air-conditioning systems deliver large volumes of information. But it is hard to know which information is important. In this article, you will know some ways of supporting the usage of big data sets using Hadoop.

Keeping and providing the data

You should consider the ways and terms of information storage prior to examine fundamental methods of data keeping.
One of the main problems with Hadoop is that it delivers append-only information for big volumes of data. However, this technique appears perfect for keeping machine data. This case turns into an issue because the amounts of information contribute needless load into an environment just when turning to live and useful.
You will need mindful management if you want to use Hadoop for storing big data. You will need a strategy to use it. If you want to use the information for live alerts, you wouldn’t like to sift for years to choose the latest info. You should select consciously what to store and for how long.
To know the volume of information you need to store you’ll have to calculate the size of your records and periods when data renews. Basing on these calculations you can get knowledge of amounts of created data. As an example, a three-field data unit is small, however, saved every 15 seconds, creates about 45 KB of data. If you multiply this to 2000 computers you’ll get 92 MB per day.
You should ask yourself: how long do you want your data to be available? By-the-minute information is not really used during a week, because the importance of this info is weak when the problem is solved.
You should also define the baseline knowing the context. The baseline is a data point or matrix shows standard operation. You can much more easily recognize aberrant trends or spikes with available baseline. Baselines are comparison values you keep to identify when a new amount is beside of standard level. Baselines have 3 types:

  • Pre-existing baselines – that is already known baseline if you monitor a big data.
  • Controlled baseline – for the units and systems which need control. Determine baseline with the comparison of the controlled and monitored value
  • Historical baselines – this type is applied to systems where the baseline is calculated by existing value.

Historical baselines certainly modified with time and, apart from exceptional conditions, never specified to a hard figure. It should be changeable based on the information you get from the sensor and the value. Baselines should be computed based on past values. As a result, you have to figure out the amount you want to compare and how far back to go.
You may keep and generate graphical representations of information, however just like with basic storage, you are improbable to return a certain moment of time. But you should keep the minimum, maximum and the standard every 15 minutes to create the graph.

Storing the data in Hadoop

Usually, Hadoop is not good enough for live database for big data. However it is a reasonable solution for appending the information into the system, a near-line SQL DB for data storage is a much preferable option. A reasonable way to load information into the database is using a permanent write into the Hadoop Distributed File System (HDFS) by adding to current. Hadoop is able to work as a concentrator.

One technique is to record every diverse information into a particular file for a time and copy this data into HDFS for handling. Also, you may write straight into a file on HDFS which is available from Hive.

Within Hadoop, various small files significantly less effective and practical compared to a small quantity of bigger files. Largest files are spread across the cluster with better effectiveness. For this reason, the information is better to spread across several nodes of the cluster.

Assembling the information from various data points to the numerous bigger files is more effective.
You have to be sure that the data is extensively transferred within the system. With a 30-node cluster, you’d like data to be split over the cluster for better efficiency. This allocation leads to the most effective transactions and replies time. It is crucial if you want to utilize the info for monitoring and alerts.
Those files may be attached via one concentrator that writes the data into these bigger files by gathering from various hosts. Separating the information in this way means you may begin to divide your data systematically in accordance with the host.

Generating Baselines

First of all, if you don’t have hard of controlled baselines, you should generate its statistics to define which is normal. This data will probably modify with time, so you desire to be capable to know what the baseline is over this time by examining available information. You could analyze the data with Hive by applying an appropriate query to create minimum, maximum and average research.

To save reexamining every time write the data into a new table to compare the definition of the issue in the incoming streams. For constant tests calculate the items you evaluate the latest and the current. Research the whole table and compute a proper value across the entire data. You also could calculate additional values like standard deviation or precision.

The important step for generating baselines is archiving the old data. First of all, you need to compress the data, then mark them appropriately. This method needs creating a modified form of the baseline query to summarizes the data by applying a set of maximum and minimum basis.
Ensure that you keep your information which could be summarized to create the data and values beyond.


When you utilize and handle raw computer data, obtaining and keeping the information into Hadoop is actually the slightest of your troubles. Instead, you have to define what data shows and the way you would like to review and report the data. When you have the raw data and is able to run queries on it within Hive you should compute your baselines. After that operate a query which primary signifies the baseline and then issues against it to locate the data beyond the baseline limits. This article contains some of the methods for handling, defining, and then eventually determining those live exceptions to find errors which should be reported and alerted and then shown to a control application.

Interested? Click here to contact us for a free consultation →