Lineland: Hive vs. Pig

Hive vs. Pig

While I was looking at Hive and Pigfor processing large amounts of data without the need to writeMapReduce code I found that there is no easy way to compare them againsteach other without reading into both in greater detail.

In this post I am trying to give you a 10,000ft view of both and comparesome of the more prominent and interesting features. The followingtable - which is discussed below - compares what I deemed to be suchfeatures:

Feature	Hive	Pig
Language	SQL-like	PigLatin
Schemas/Types	Yes (explicit)	Yes (implicit)
Partitions	Yes	No
Server	Optional (Thrift)	No
User Defined Functions (UDF)	Yes (Java)	Yes (Java)
Custom Serializer/Deserializer	Yes	Yes
DFS Direct Access	Yes (implicit)	Yes (explicit)
Join/Order/Sort	Yes	Yes
Shell	Yes	Yes
Streaming	Yes	Yes
Web Interface	Yes	No
JDBC/ODBC	Yes (limited)	No

Let us look now into each of these with a bit more detail.

General Purpose

The question is "What does Hive or Pig solve?". Both - and I think thislucky for us in regards to comparing them - have a very similar goal.They try to ease the complexity of writing MapReduce jobs in aprogramming language like Java by giving the user a set of tools thatthey may be more familiar with (more on this below). The raw data isstored in Hadoop's HDFS and can be any format although natively itusually is a TAB separated text file, while internally they also maymake use of Hadoop's SequenceFile file format. The idea is to be able toparse the raw data file, for example a web server log file, and use thecontained information to slice and dice them into what is needed forbusiness needs. Therefore they provide means to aggregate fields basedon specific keys. In the end they both emit the result again in eithertext or a custom file format. Efforts are also underway to have both useother systems as a source for data, for example HBase.

The features I am comparing are chosen pretty much at random becausethey stood out when I read into each of these two frameworks. So keep inmind that this is a subjective list.

Language

Hive lends itself to SQL. But since we can only read already existingfiles in HDFS it is lacking UPDATE or DELETE support for example. Itfocuses primarily on the query part of SQL. But even there it has itsown spin on things to reflect better the underlaying MapReduce process.Overall is seems that someone familiar with SQL can very quickly learnHive's version of it and get results fast.

Pig on the other hand looks more like a very simplistic scriptinglanguage. As with those (and this is a nearly religious topic) some aremore intuitive and some are less. As with PigLatin I was able to seewhat the samples do, but lacking the full knowledge of its syntax I wassomewhat finding myself thinking if I really would be able to get what Ineeded without too many trial-and-error loops. Sure, the Hive SQL needsprobably as many iterations to fully grasp - but there is at least agreater understanding of what to expect.

Schemas/Types

Hive uses once more a specific variation of SQL's Data DefinitionLanguage (DDL). It defines the "tables" beforehand and stores the schemain a either shared or local database. Any JDBC offering will do, but italso comes with a built in Derby instance to get you started quickly.If the database is local then only you can run specific Hive commands.If you share the database then others can also run these - or would haveto set up their own local database copy. Types are also defined upfrontand supported types are INT, BIGINT, BOOLEAN, STRING and so on. Thereare also array types that lets you handle specific fields in the rawdata files as a group.

Pig has no such metadata database. Datatypes and schemas are definedwithin each script. Types furthermore are usually automaticallydetermined by their use. So if you use a field as an Integer it ishandled that way by Pig. You do have the option though to override itand have explicit type definitions, again within the script you needthem. Pig has a similar set of types compared to Hive. For example italso has an array type called "bag".

Partitions

Hive has a notion of partitions. They are basically subdirectories inHDFS. It allows for example processing a subset of the data by alphabetor date. It is up to the user to create these "partitions" as they arenot enforced nor required.

Pig does not seem to have such a feature. It may be that filters can achieve the same but it is not immediately obvious to me.

Server

Hive can start an optional server, which is allegedly Thrift based. Withthe server I presume you can send queries from anywhere to the Hiveserver which in turn executes them.

Pig does not seem to have such a facility yet.

User Defined Functions

Hive and Pig allow for user functionality by supplying Java code to thequery process. These functions can add any additional feature that isrequired to crunch the numbers as required.

Custom Serializer/Deserializer

Again, both Hive and Pig allow for custom Java classes that can read orwrite any file format required. I also assume that is how it connects toHBase eventually (just a guess). You can write a parser for Apache logfiles or, for example, the binary Tokyo Tyrant Ulog format. The same goes for the output, write a database output class and you can write the results back into a database.

DFS Direct Access

Hive is smart about how to access the raw data. A "select * from tablelimit 10" for example does a direct read from the file. If the query istoo complicated it will fall back to use a full MapReduce run todetermine the outcome, just as expected.

With Pig I am not sure if it does the same to speed up simple PigLatinscripts. At least it does not seem to be mentioned anywhere as animportant feature.

Join/Order/Sort

Hive and Pig have support for joining, ordering or sorting datadynamically. They perform the same purpose in both pretty allowing youto aggregate and sort the result as is needed. Pig also has a COGROUPfeature that allows you to do OUTER JOIN's and so on. I think this iswhere you spent most of your time with either package - especially whenyou start out. But from a cursory look it seems both can do pretty muchthe same.

Shell

Both Hive and Pig have a shell that allows you to query specific thingsor run the actual queries. Pig also passes on DFS commands such as "cat"to allow you to quickly check what an outcome of a specific PigLatinscript was.

Streaming

Once more, both frameworks seem to provide streaming interfaces so thatyou can process data with external tools or languages, such as Ruby orPython. How the streaming performs I do not know and if they affect themdifferently. This is for you to tell me :)

Web Interface

Only Hive has a web interfaceor UI that can be used to visualize the various schemas and issuequeries. This is different to the above mentioned Server as it is aninteractive web UI for a human operator. The Hive Server is for use fromanother programming or scripting language for example.

JDBC/ODBC

Another Hive only feature is the availability of a - again limitedfunctionality - JDBC/ODBC driver. It is another way for programmers touse Hive without having to bother with its shell or web interface, oreven the Hive Server. Since only a subset of features is available itwill require small adjustments on the programmers side of things butotherwise seems like a nice-to-have feature.

Conclusion

Well, it seems to me that both can help you achieve the same goals,while Hive comes more natural to database developers and Pig to "scriptkiddies" (just kidding). Hive has more features as far as access choicesare concerned. They also have reportedly roughly the same amount ofcommitters in each project and are going strong development wise.

This is it from me. Do you have a different opinion or comment on the above then please feel free to reply below. Over and out!

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。