Sqoop Incremental Import

Sqoop Incremental Import

Consider a table with 3 records which you already imported to hdfs using sqoop

n

+------+------------+----------+------+------------+n| sid  | city       | state    | rank | rDate      |n+------+------------+----------+------+------------+n|  101 | Chicago    | Illinois |    1 | 2014-01-25 |n|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |n|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |n+------+------------+----------+------+------------+nnsqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -Pn

n

Now you have additional records in the table but no updates on existing records

n

+------+------------+----------+------+------------+n| sid  | city       | state    | rank | rDate    |n+------+------------+----------+------+------------+n|  101 | Chicago    | Illinois |    1 | 2014-01-25 |n|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |n|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |n|  103 | Charlotte  | NC       |    9 | 2013-04-22 |n|  103 | Greenville | SC       |    9 | 2013-05-12 |n|  103 | Atlanta    | GA       |   11 | 2013-08-21 |n+------+------------+----------+------+------------+n

n

Here you should use an --incremental append with --check-column which specifies the column to be examined when determining which rows to import.

n

sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value 7n

n

The above code will insert all the new rows based on the last value.

n

Now we can think of second case where there are updates in rows

n

+------+------------+----------+------+------------+n| sid  | city       | state    | rank | rDate      |n+------+------------+----------+------+------------+n|  101 | Chicago    | Illinois |    1 | 2015-01-01 |n|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |n|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |n|  103 | Charlotte  | NC       |    9 | 2013-04-22 |n|  103 | Greenville | SC       |    9 | 2013-05-12 |n|  103 | Atlanta    | GA       |   11 | 2013-08-21 |n|  104 | Dallas     | Texas    |    4 | 2015-02-02 |n|  105 | Phoenix    | Arzona   |   17 | 2015-02-24 |n+------+------------+----------+------+------------+n

n

Here we use incremental lastmodified where we will fetch all the updated rows based on date.

n

sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P   --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/locn

In answer to your first question, it depends on how you run the import statement. If you use the --incremental append option, you would be specifying your --check-column and --last-value arguments. These will dictate exactly which records are pulled and they will simply be appended to your table. nFor example: you could specify a DATE type column for your --check-column argument and a very early date (like 1900-01-01 or Day1 in your case) for --last-value and this would just keep appending everything in the source table (creating duplicate rows) to your destination. In this case, the new part files created will hold both new and old records. You could also use an increasing ID column and keep entering the small ID and that would have the same effect. However, if --last-value is Day2, there will be additional part files with only new records. Im not sure if you were wondering if you would lose the old records (just in case you were) but thats not the case.

n

The last-modified argument for --incremental would only be useful if, in the future, you go back and update some of the attributes of an existing row. In this case, it replaces the old data in your table (and adds the new stuff) with the updated version of the row thats now in your source table. Hope this helps!

n

Oh, all of this is based on The Sqoop User Guide Section 7.2.7 https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

n

and Chapter 3 of the Apache Sqoop Cookbook (that chapter is actually fantastic!)

Sqoop Incremental Import

Step1 : The entire table is imported. This will be available as part-m file in your specified HDFS location (say /user/abc/def/part-m-00000)nStep2 : Only the incremental records are imported. This will be available in another location (say /user/abc/def1/part-m-00000)

n

Now that both the data are available, you can use the sqoop merge option to consolidate both based on the key column.

n

Refer to the below doc. for more details

n

https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_literal_sqoop_merge_literal

Leave a Reply

Your email address will not be published.