Hive lets you use SQL on Hadoop, but tuning SQL on a distributed system is different. Here are 12 tips to help your effort fly
Hive is not an RDBMS, but it pretends to be one most of the time. It has tables, it runs SQL, and it supports both JDBC and ODBC.
The good and bad news of this revelation: Hive doesnโt run queries the way an RDBMS does. Itโs a long story, but I spent an 80-plus-hour workweek personally tuning Hive. Needless to say, my head wonโt quit buzzing. So for your benefit, here are a few suggestions to make your next Hive project go a little bit faster than mine did.
1. Donโt use MapReduce
Whether you believe in Tez, Spark or Impala, donโt believe in MapReduce. It is slow on its own, and itโs really slow under Hive. If youโre on Hortonworkโs distribution, you can throw set hive.execution.engine=tez at the top of a script. On Cloudera, use Impala. Hopefully, soon you can set hive.execution.engine=spark when Impala isnโt appropriate.
2. Donโt do string matching in SQL
Ever! Especially in Hive. If you stick a like string match where a clause should be, youโll generate a cross-product warning. If you have a query that runs in seconds, with string matching it will take minutes. Your best alternative is to use one of many tools that allow you to add search to Hadoop. Look at Elasticsearchโs Hive integration or Lucidworkโs integration for Solr. Also, thereโs Cloudera Search. RDBMSes were never good at this, but Hive is worse.
3. Donโt do a join on a subquery
Youโre better off creating a temporary table, then joining against the temp table instead of asking Hive to be smart about how it handles subqueries. Meaning donโt do this:
select a.* from something a inner join (select ... from somethingelse union b select ... from anotherthing c) d on a.key1 = d.key1 and a.key2 = b.key2 where a.condition=1
Instead,ย do this:
create var_temp as select ... from somethingelse b unionย select ... from anotherthing c and then select a.* from something a inner join from var_temp b where a.key1=b.key1 and a.key2=b.key2 where a.condition=1
It really shouldnโt be tons faster at this point in Hiveโs evolution, but it is, generally.
4. Use Parquet or ORC, but donโt convert to them for sport
That is, use Parquet or ORC as opposed to, say, TEXTFILE. However, if you have text data coming in and are massaging it into something slightly more structured, do the conversion to the target tables. You canโtย <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables" rel="noopener nofollow" target="_blank">LOAD DATA</a>ย from a text file into an ORC, so do the initial load into a text.
When you create other tables against which youโll ultimately run most of your analysis, do your ORCing there because converting to ORC or Parquet takes time and isnโt worth it as step one in your ETL process. If you have simple flat files coming in and arenโt doing any tweaking, then youโre stuck loading into a temporary table and doing a select create into an ORC or Parquet. I donโt envy you because it is kind of slow.
5. Try turning vectorization on and off
Add set hive.vectorized.execution.enabled = true set hive.vectorized.execution.reduce.enabled = true to the top of your scripts. Try it with them on and off because vectorization seems problematic in recent versions of Hive.
6. Donโt use structs in a join
I have to admit my native-brain SQL syntax is about SQL-92 era, so I donโt tend to use structs anyhow. But if youโre doing something super-repetitive like ON clauses for compound PKs,ย structsย are handy. Unfortunately, Hive chokes on them โ particularly in the ON clause. Of course, it doesnโt do so at smaller data sets and yields no errors much of the time. In Tez, you get a fun vector error. This limitation isnโt documented anywhere that I know of. Consider this a fun way to get to know the innards of your execution engine!
7. Check your container size
You may need to increase your container size forย Impalaย orย Tez. Also, the โrecommendedโ sizes may not apply to your system if you have larger node sizes. You might want to make sure your YARN queue and generalย YARN memory are appropriate. You might also want toย peg it to something that isnโt the default queue all the peasants use.
8. Enable statistics
Hive doesย somewhat boneheaded thingsย withย joins unless statistics are enabled. You may also want to useย query hints in Impala.
9. Considerย MapJoin optimizations
If you do an explain on your query, you may find that recent versions of Hive are smart enough to apply the optimization automatically. But you may need to tweak them.
10. If you can, put the largest table last
Period.
11. Partitions are your friends โฆ sorta
If you have this one item in many places where clauses like a date (but ideally not a range) or a location repeat, you might have your partition key! Partitions basically mean โsplit this into its own directory,โ which means instead of looking at a big file, Hive looks at the one file because you have it in your join/where clause saying youโre only looking at location=โNCโ, which is a small subset of your data. Also, unlike with column values, you can push partitions in yourย <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables" rel="noopener nofollow" target="_blank">LOAD DATA</a>ย statements. However, remember thatย HDFS does not love small files.
12. Use hashes for column comparisons
If youโre comparing the same 10 fields in every query, consider using <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF" rel="noopener nofollow" target="_blank">hash()</a>ย and comparing the sums. These are sometimes so useful you might shove them in an output table. Note that the hash in Hive 0.12 is a low resolution, but better hashes are available in 0.13.
Thereโs my dirty dozen. I hope this means my endless late-night sufferings will not have been in vain โ andย help keep others from getting stung by Hiveโs eccentricities.


