Importing Data into the Database
Using Aginity Workbench, you can import data from local files into your Hadoop database, choosing whether to load the imported data into an existing table or create a table for the data.
Workbench supports import of source files in the following formats:
- Microsoft Office Excel 97-2003 (.xls) and Microsoft Office Excel 2007 (.xlsx)
- Comma Separated Values (CSV)
- Text File (.txt)
- Fixed Column Widths File
- JSON/XML File
Files in all the supported formats can be imported directly from an archive (zip, gzip, bzip2) without creating any intermediate disk file that may be quite large. Multiple CSV files can be imported at a time.
Import Data into a Destination Database
You can import data from a local source file into a new or existing Hadoop table.
To import data from a local file into a Hadoop table:
- On the Tools menu, point to Import and then click the command for the source file format (for example, from Excel).
- If you want to import an Excel file, browse to and select the file and then select the worksheet in the Excel file.
The Data Import Wizard opens.
If you are importing from an Excel file and have used the Hive method to connect to your Hadoop cluster, note that Hive doesn't support bulk-loading of data from a client computer to a Hive table. Therefore the data is first loaded into an intermediate .csv file and then is imported from the intermediate file into a destination table. You need to specify a location for the intermediate file in HDFS in the Location box on the first page of the Data Import Wizard.
- If you are importing a JSON/XML file, select the nodes that will be turned into a flattened table during import.
- Select a file encoding and a field delimiter (applicable to a CSV or text file only), or the column widths and file encoding (applicable to a fixed column widths file only).
Specify the range of data to import by setting the Start Column, Start Row, and End Row. If you don't want to import all the rows, clear the Import all rows check box and specify the End Row.
During import of a CSV or text file, Workbench analyzes the data types in the file and makes necessary changes: it adjusts the length of some or all of the columns to the maximum length of the column's data type or changes the data type of columns. By default Workbench analyzes a maximum of 1,000 rows. If the file you are importing contains more than 1,000 rows, the Data Import Wizard displays a message that only 1,000 rows were analyzed. You can select the Analyze more option later in the wizard to have Workbench analyze all the rows in the file.
- Specify whether the first row in the source file contains the column names to be used in the destination table. Ensure that the Include Columns box lists only the columns that you want to import. Use the less than button to remove columns from the Include Columns box and the greater than button to add columns.
- You can specify the order of columns in the target table by using the Up and Down buttons in the Include Columns box.
- The column that is currently selected in the Include Columns box is highlighted in the Data Preview box.
- Specify data types, precision, length, and scale for the columns. Leave the default values that Aginity Workbench has determined to be valid syntax for the source data, or change the values. Select the check box in the Null column if the column can contain null values.
A negative value of Scale means that scale is not applicable to the column.
- Treat all columns as varchar - Selecting this check box activates the Override length to be box. You can override the automatically determined length for all the columns by typing a value in this box and clicking Set.
- Trim strings - Select this check box to remove trailing spaces from strings in the file. This will apply to all string columns in the file (VARCHAR, NVARCHAR, and so on). If you want to remove trailing spaces from individual string columns, clear the Trim strings check box and select the Trim check box next to each column.
During import from an Excel file, trailing spaces are trimmed automatically, and this behavior cannot be changed. To preserve trailing spaces, we recommend that you choose a different file format for import.
- Analyze more (Re-check data types) - Click this button to have Aginity Workbench analyze the data types in the file and automatically make necessary changes. During analysis, the application adjusts the length of some or all of the columns to the maximum length according to the column's data type or changes the data type. The Analyze more option appears for a file that contains more than 1,000 rows (as shown below); the Re-check data types options appears for a file of fewer than 1,000 rows. Click the button and then click Yes to start the analysis. The progress dialog box appears. You can click Stop on the dialog box to stop the analysis for currently processed rows. When the analysis is complete, a dialog box shows a total number of rows where the column lengths or data types were adjusted and lists the changes that were made, as shown below. If the application could not correct an error in data type (such as an incompatible data type for a column), an error message is displayed. You will need to correct the error manually.
- Select a destination database and a schema in the database, and type a name for the Hadoop table to hold the imported data. Under Existing Table Action, select an action to take if the specified table exists in the destination database. Select Ignore empty lines (all NULLS) if you don't want to import the empty lines or rows in the source file.
- In the final wizard page, specify the following settings:
- Create Hadoop Table - To import the data into a Hadoop table, select this check box. To import the data into a DB2 table, clear the check box and specify whether the table uses row or column organization.
Because the option to import into a DB2 table is not applicable to Hive, the Create Hadoop Table check box is unavailable if you have connected to the Hadoop cluster using the Hive method.
- Store As - Select a setting that determine the format of the destination table. When you select Custom and then click Custom SerDe Properties, a dialog box opens where you can set SerDe formatting properties.
- External - Select this check box to import the data into an external Hadoop table. Clear the check box to import into a managed Hadoop table.
- Use Default Location - Specify the name of the physical folder in HDFS to store the table data. Leave Use Default Location selected to use the default folder in HDFS. To use a different folder, clear the Use Default Location check box, click the ellipsis button at the end of the Location box, connect to HDFS, and select the folder.
- Analyze After Import - Select this check box to have Workbench generate and execute an ANALYZE script for the Hadoop table after import to generate statistics on the table. The server will use the generated statistics to optimize queries that reference the table.
The Analyze After Import option is only available if you have connected to your Hadoop cluster using the Big SQL method.
You can also have Workbench generate and execute an ANALYZE script for selected columns in the table after import to produce column-level statistics. You can specify the columns to analyze individually or as a group. If you define a column group, you will be able to generate distinct statistics for the grouping with the script. Click To generate statistics on selected columns after import below for instructions.
To generate statistics on selected columns after import
- Click Analyze Options.
- In the Generate Analyze Table Script dialog box, select the Compute statistics for columns check box.
- Do one or both of the following:
- To specify individual columns to analyze, in the Available Columns box, select each column and click the "greater than" button (>) to move the column to the Columns to Gather Statistics for box. To move all the columns in the table to the Columns to Gather Statistics for box, click the "double greater than" button (>>).
- To specify a group of columns to analyze, click Column Groups, click New in the Column Groups to Gather Statistics for dialog box and then use the "greater than" button to move each column that you want to put in the group to the Included Columns box. The names of the columns in the group are displayed in the Column Groups to Gather Statistics for dialog box. Click OK to return to the Generate Analyze Table Script dialog box.
- In the Generate Analyze Table Script dialog box, click OK.
- Click Finish.
The Job Monitor displays the progress of a job for the file import. When the import is complete, the Status description column displays a corresponding status.
- If you have selected the Existing Table Action of Do Not Import and the specified destination table exists in the database, the Job Monitor displays an error message.
- If you have selected the Analyze option to generate statistics on the table and/or selected columns from the table, and generating or executing an ANALYZE script for the table fails for some reason, the Job Monitor displays a corresponding status.
If you chose to import the data into a new table, the table is displayed within the specified schema node in the Object Browser.
You may need to refresh the Object Browser for the table to be displayed.
Aginity Workbench for Hadoop *
Topic last updated August 30, 2018