Skip to main content
Lab - Working with Hive Tables
Deliverable: 6 Screenshot in one word document. I have mentioned where these screenshot are to be taken.

  1. In the cloudera Quickstart vm. Download the SalesData.csv from Blackboard.
  2. Open the Hue UI.  
  3. Click Hue menu and select Files → Go to /user/hive/warehouse/
  4. Click Upload button and select Files. Select the SalesData.csv file that you downloaded from Downloads folder.
  5. Click Hue menu and select Tables to create a new table
  6. We will create a table from a file. Click on the + button on top right side of the screen.
  7. We’ll use a file to create our table. Ensure that File is selected as Type.
  8. Select the path by selecting the ellipse button. Click Upload a file, and select the /user/hive/warehouse/SalesData.csv
  9. Click Next. Review Field Names and Field Types listed and click submit.
  10. Hive will create table using the file you uploaded. Take a look at the newly created table in Hive. Take screenshot 1 here.
  11. When uploading CSV file to create table it is important to ensure that there is no commas in field data. If there are commas in the data, the table columns in Hive will have wrong data. Custom SerDe (Serializer/Deserializer) code can be used to fix this problem.  
  12. In the cloudera Quickstart vm. Download the SalesData-WithCommas.csv and csv-serde-1.1.2-0.11.0-all.jar from Blackboard.
  13. Open Hive Menu → Files → Goto folder /user/hive/warehouse
  14. Click Upload button and select Files. Select the SalesData-WithCommas.csv file that you downloaded from Downloads folder.
  15. Click Hue menu and select Tables to create a new table
  16. We will create a table from a file. Click on the + button on top right side of the screen.
  17. We’ll use a file to create our table. Ensure that File is selected as Type.
  18. Select the path by selecting the ellipse button. Click Upload a file, and select the /user/hive/warehouse/SalesData-WithCommas.csv
  19. Click Next. Review Field Name and Field Type and click submit.
  20. Open the salesdatawithcommas table and check the company name you will notice “Netflix, inc.” has been separated into two columns. Sector and columns after Sector have wrong data. Take screenshot 2 here.
  21. We will fix this using custom serialization and deserialization. Open Hive Query editor and select all from the salesdatawithcommas table.  
  22. Lets alter the table using a custom serde code. Type the following query, do not execute it yet.
ALTER TABLE salesdatawithcommas SET SERDE ‘com.bizo.hive.serde.csv.CSVSerde’
  1. Click the session button (the one with gears icon) → Select files → click on ellipses and select the jar file. Close the window and hit the play button to run the query. Take screenshot 3 here.
  2. Now you should be able to notice that we have fixed our issue with commas in data field.
  3. The Hive performance can be improved by partitioning the table into multiple tables. Partitions are based on one or more partition column. Let’s create a table that is partitioned by year. Download YearlySales.zip file and extract it.
  4. Open Hive Editior and use the following script to create the table definitions in Hive. Take screenshot 4 here.
create table sales_all_years (RowID smallint, OrderID int, OrderDate date, OrderMonthYear date, Quantity int, Quote float, DiscountPct float, Rate float, SaleAmount float, CustomerName string, CompanyName string, Sector string, Industry string, City string, ZipCode string, State string, Region string, ProjectCompleteDate date, DaystoComplete int, ProductKey string, ProductCategory string, ProductSubCategory string, Consultant string, Manager string, HourlyWage float, RowCount int, WageMargin float)
partitioned by (yr int)
row format serde ‘com.bizo.hive.serde.csv.CSVSerde’
stored as textfile;
  1. Click Hue Menu → select Files → go to /user/hive/warehouse/sale_all-years
  2. Create 4 directories 2009, 2010, 2011 and 2012. Then, upload appropriate csv file into each directory. For example, add SalesData-2009.csv to 2009 folder.
  3. Open Hive editor and run the following scripts: After it executes, take screenshot 5 here.
-- add the partitions
alter table sales_all_years
add partition (yr=2009)
location '2009/';

alter table sales_all_years
add partition (yr=2010)
location '2010/';

alter table sales_all_years
add partition (yr=2011)
location '2011/';

alter table sales_all_years
add partition (yr=2012)
location '2012/';
  1. Open the sales_all_year table. You should be able to see the data that you uploaded. Take ScreenShot 6 of the data.
  2. Submit the screenshots in a word file.

Comments

Popular posts from this blog

CUMIPMT and CUMPRINC function

CUMIPMT Cumulative interest payment function allows you to calculate the interest paid for a loan or from an investment from period A to period B. When getting a loan, CUMIPMT function can be used to calculate the total amount of interest paid in the first five months or from period 12 to period 20. A period can be a month, a week or two week. Loan Amount : 350,000.00 APR: 4.5% Down payment: 0.00 Years: 25 Payment per year: 12 From the above data, we can calculate the following: No of Period: 25 × 12 = 300 Periodic Rate: 4.5/12 = 0.375% Here is how you will substitute these values into the function. = CUMIPMT (periodic rate, No of period, vehicle price, start period, end period,  ) = CUMIPMT (0.375, 300, 350000, 1, 5, 0) In an excel worksheet, we use cell address instead of actual values as shown below: Here is the formula view of the worksheet: CUMPRINC Another related function is CUMPRINC. CUMPRINC function is used to calculate cumul

Excel PMT Function

PMT function is very useful for calculating monthly payment required to payback a loan or mortgage at a fixed rate. This function require a minimum of three inputs, periodic rate, number of periods, present value or the loan amount. Here is a simple example. Home Loan: 350,000.00 Interest rate: 4.5% Number of years to repay the loan: 25 Note: To calculate monthly payment, we need to find the monthly rate and number of months as shown above. Then it is simply a matter of substituting the values into the payment function, as shown in the formula view below.

BCG's Brand Advocacy Index

The Boston Consulting Group's (BCG) Brand Advocacy Index (BAI) is a metric developed to help companies measure the degree of customer advocacy for their brands. BAI focuses on the likelihood of customers to recommend a brand to others, which is a powerful indicator of brand strength and customer loyalty. Unlike other customer satisfaction or loyalty metrics, BAI emphasizes the importance of customer referrals and word-of-mouth marketing. BAI is calculated based on a survey where customers are asked about their willingness to recommend a brand to their friends, family, or colleagues. The responses are then used to compute a score, which ranges from -100 to 100. A higher BAI score indicates that a brand has more advocates who are likely to recommend the brand to others, while a lower score suggests that the brand has fewer advocates or even a higher number of detractors. BCG's research has shown that companies with higher BAI scores tend to experience higher growth rates and bett