Hive is a data warehousing engine to facilitate interaction between user and Hadoop Distributed File System (HDFS). It helps to process and analyze huge amounts of data built on top of Hadoop. It also helps non-programmers to understand and manage large datasets inside distributed Hadoop storage by using SQL-like queries.
Earlier we used MapReduce for analysis of massive volumes of unstructured or semi-structured data. But being less user-friendly for new or non-programmers, a better interface was necessary, and Hive was introduced.
A little history about Apache Hive will be interesting to know. It was Facebook that first built Hive. Initially around 2006 Facebook started gathering data and ingesting it into Hadoop, at the rate of tens of GBs per day. Within few years it grew to several TB/day. Initially, Python scripts were written to ingest these data in Oracle databases, but with the increasing data rate this was becoming difficult. It was time to develop a new kind of system that could handle large amounts of data and that most people who had SQL skills could use the new system with minimal changes, compared to what was required with other RDBMS.
Important Features of Hive
- Create a database and table first, then data is load into them.
- Hive supports Online Analytical Processing (OLAP). It also supports four types of file formats: TEXTFILE, SEQUENCEFILE, ORC & RCFILE (Record Columnar File).
- HiveQL can deliver various types of querying language.
- This tool is able to operate in two modes: local mode & MapReduce mode.
- Hive can partition the data in a directory-like format to improve query performance. It helps in easy data retrieval.
- It supports Python, Java and C++. Users can write their application on any language of their choice.
- Processing of queries take place using Hadoop MapReduce framework.
- Hive uses MySQL for multiple user metadata storage.
- It has built-in UDFs (User Defined Functions) for manipulation of strings, date types and data-mining functions. It can also extend the UDF set which are not part of the predefined functions.
Uses and Advantages
Apache Hive has many advanced features as compared to the traditional RDBMS.
- We can widely use Hive for data warehousing tasks like analysis of humongous datsets, data encapsulation and ad-hoc queries. It can be widely used for data preparation, data mining, ETL or even ad-hoc queries.
- Rather than the earlier DBMS, user can easily define functions in Hive.
- It is beginner friendly and anyone with basic knowledge of SQL can code in Hive.
- We can convert the queries easily to RHive, or any other Hadoop package.
- By easing the querying, analysis and summarization of data, it helps in increasing workflow efficacy.
- We can deploy it anywhere, in public cloud, at the edge, anywhere.
- It drastically reduces the cost and time required for the ETL process.
- Big corporates who already have establishments with SQL are now moving towards Hive.
Hive had been initially designed to enhance scalability, extensibility as well as performance. However, some small hindrances still remain as it still does not support updates, neither delete, nor subquery. But even then due to ease of coding and user-friendliness, it is a much preferred tool. When compared to RDBMS, it has so many features.