Perl

Perl programming language tutorials.

Designing SQL Queries for an ETL System in Perl

I have a need for an ETL (Extract, Transform, and Load) application for a project I am working on.  Since the E and T parts are pretty simple, I didn't want to go out and use one of the many open source ETL frameworks.  I'm always looking for a challenge, so I also thought it would be a good exercise for me to see if I can test my SQL knowledge to design the SQL queries to do Load part of the ETL framework myself.  Think of it as a poor man's Business Intelligence system that I'm designing.

The Load part of an ETL system can be difficult to design, you must consider:

  • the ability to scale well with number of records
  • to perform within processing time constraints
  • to not put the database in a bad state during the huge database operations
  • to maintain data integrity by keeping in sync with any corrections in the data source

I have been operating for a while without any sort of database driven solution by using massive CSV files created from the data source queries.  It was practical to prototype using CSV files, but a better designed, better performing long term solution is needed.  Also, I'm getting to the point where those CSV files are so big and take up a lot of memory when slurping them into my perl scripts, that I'm running upon RAM limits on the server.  Not good at all!

The Load process gets complicated because the individual records I pull in from the data source might be changed or deleted since last time, so I need to be sure to check my data extract against the current stored data, and UPDATE any existing records.  Essentially, corrections to the data can happen at any point in the past, and I need to catch them and make sure the data reflects it.  Since our data is driven by dates, this means I need to do big queries over large lengths of time.  I contemplated a DELETE of all data, and then query it all fresh and INSERT, but I might as well just use CSV files then.  We can do better!

So knowing this, the operations that need to be done on the data extract retrieved from the data source are the following:

  • Check stored data against extract to delete current stored records that no longer exist
  • Check stored data against extract to change current stored records that exist
  • Check for records in extract that do not exist in the stored data and add them

It is important to have all of these operations in place to guarantee the stored data is correct.

To accomplish the Load part, I thought it best to use a database table architecture where I have a staging table, and a main database table.  I decided on this because I didn't want to be using perl to query the stored data and comparing to the extracted data from the data source.  I wanted to keep the perl script relatively simple and try to do as much stuff in smarter SQL queries to avoid having lots of individual INSERT/UPDATE/DELETE statements hitting the server from my perl script.  So getting the data up to the staging table is the first step, then running SQL queries on the staging table to modify the main table is the next step.

The data from the extract will be manipulated (transformed) in the perl script, then will be loaded into the staging table, which is an exact copy of the structure of the main table.

To allow the extracts from the data source take less time, I also wanted to be able to pull only certain time ranges and do the Load incrementally.  This works since my stored data is based on time.

But since everything can't be simple, I also need to mention that my data is designed where the unique identifiers are from two different columns, one of which is a measurement of time.

So the data is actually laid out like this:

  • keya - Integer
  • keyb - Integer
  • valuea - Integer
  • valueb - Integer

To start spewing out some SQL code, once I load the data into the staging table (in table main_stage), to be able to UPDATE the stored data (in table main) with the staging data for records that exist in both, this is the query.

 

UPDATE main AS m INNER JOIN main_stage AS s USING (keya,keyb) SET m.valuea=s.valuea, m.valueb=s.valueb

That will only update records in table main if the keys match in the table main_stage.

Now, I want to be able to INSERT records into the table main from main_stage where the keys appearing in main_stage do not appear in main.  Here is the SQL for that:

 

INSERT INTO main SELECT * FROM main_stage AS s WHERE NOT EXISTS (SELECT * FROM main AS m WHERE m.keya = s.keya AND m.keyb = s.keyb)

So now you have handled updating existing records, and inserting new ones, so you can empty the staging data table.

 

TRUNCATE TABLE main_stage

So after all this, you may have noticed that I didn't handle the part about deleting records in my stored data (table main), that didn't exist in the staged data.  Because I often will be loading the data into the staged table by date range, I wanted to cover that separately, even though it needs to occur before you TRUNCATE the staged table.

Here is the SQL, assuming that <criteria> is the criteria that you used to select a date range to load the staging table.

 

DELETE FROM main WHERE <criteria> AND NOT EXISTS (SELECT * FROM main_stage AS s WHERE main.keya = s.keya AND main.keyb = s.keyb)

So for all of the SQL in one block, here is the sequence:

 

DELETE FROM main WHERE <criteria> AND NOT EXISTS (SELECT * FROM main_stage AS s WHERE main.keya = s.keya AND main.keyb = s.keyb)

UPDATE main AS m INNER JOIN main_stage AS s USING (keya,keyb) SET m.valuea=s.valuea, m.valueb=s.valueb

INSERT INTO main SELECT * FROM main_stage AS s WHERE NOT EXISTS (SELECT * FROM main AS m WHERE m.keya = s.keya AND m.keyb = s.keyb)

TRUNCATE TABLE main_stage

To meet some of the criteria where you don't want to have anything cause the database tables to be in an intermediate state of not being correct, it is best to do the statements above as Transactions, so that only if they all succeed, do the operations actually take place.  Here is an example of perl code, running these queries, that has transactions enabled.

my $dsn = "dbi:mysql:dbtest:localhost:3306";
my $oDB = DBI->connect($dsn,"myusername","mypassword") or die "Unable to connect, error: $DBI::errstr\n";

# now process the staging table, using transactions
my $success = 1;
my $result = 1;

# turn autocommit off
$oDB->{'AutoCommit'} = 0;
if ($oDB->{'AutoCommit'}) {
  die "AutoCommit disable failed\n";
}

# if any records in main do not exist in staging table, delete the records in main
my $main_delete = "DELETE FROM main WHERE <criteria> AND NOT EXISTS (SELECT * FROM main_stage AS s WHERE main.keya = s.keya AND main.keyb = s.keyb)";
my $main_delete_handle = $oDB->prepare($main_delete);
$success &&= $main_delete_handle->execute();

# if any records in main match what is in staging table, update the records in main
my $main_update = "UPDATE main AS m INNER JOIN main_stage AS s USING (keya,keyb) SET m.valuea=s.valuea, m.valueb=s.valueb";
my $main_update_handle = $oDB->prepare($main_update);
$success &&= $main_update_handle->execute();

# what remains in staging table are new records, so insert any records from staging table into main
my $main_insert = "INSERT INTO main SELECT * FROM main_stage AS s WHERE NOT EXISTS (SELECT * FROM main AS m WHERE m.keya = s.keya AND m.keyb = s.keyb)";
my $main_insert_handle = $oDB->prepare($main_insert);
$success &&= $main_insert_handle->execute();

# delete the remainder of the staging table data
my $main_delete_insert = "TRUNCATE TABLE main_stage";
my $main_delete_insert_handle = $oDB->prepare($main_delete_insert);
$success &&= $main_delete_insert_handle->execute();

# determine whether to commit or not
$result = ($success ? $oDB->commit : $oDB->rollback);
unless($result) {
  die "Couldn't finish transaction: " . $oDB->errstr . "\n";
}

# turn autocommit back on
$oDB->{'AutoCommit'} = 1;
if (!$oDB->{'AutoCommit'}) {
  die "AutoCommit enable failed\n";
}

$oDB->disconnect;

Using Perl to Access the Lawson DME in LSF9 Environment

I wrote a previous article about using perl to query the Lawson DME for 8.0.3 environment.  We just got the LSF9 environment (not apps), and to my surprise, the authentication is different between the two.

The authentication for Lawson DME in 8.0.3 environment uses http basic authentication.

The authentication for LSF9 environment uses Lawson's own server side authentication.

Obviously this means that the method of accessing the DME from perl needs to change.  The perl code design has to change from using built in LWP credentials method of navigating through HTTP basic auth to simulating a login on the Lawson login page, capturing the cookie, and bringing it along with to make the request to the DME.

Here is the new code.  The function takes 3 arguments, a URL, userame, and password.  We only have a secure https address to access our DME, so that is why the Crypt::SSLeay library is needed.

 

###############
# gets the raw Lawson DME response from the server from a https secure site
# arguments:
#   URL - full URL, is designed to use https URLs
#   Username - Lawson portal username
#   Password - Lawson portal password
###############
sub getLawsonDMEResponse($$$) {
  use LWP::UserAgent;
  use Crypt::SSLeay;
  use HTML::Form;
  my $url = shift;
  my $username = shift;
  my $password = shift;

  my $ua = LWP::UserAgent->new;
  $ua->cookie_jar( {} ); # cookies needed for login authentication
  my $response = $ua->get($url);
  
  if($response->is_success) {
    # parse loginForm form and enter the values for the _ssoUser and _ssoPass field names
    my @forms = HTML::Form->parse($response);
    $forms[0]->value("_ssoUser",$username);
    $forms[0]->value("_ssoPass",$password);
    my $loginButton = $forms[0]->find_input("_ssoLogin");
    $loginButton->disabled(0);
    my $response2 = $ua->request($forms[0]->click("_ssoLogin"));
  
    if($response2->is_success) {
      # its OK, response 2 never succeeds, oh well, all we want is a cookie
      # it will fail with a 302 moved
    }
    else {
      # go ahead and call the URL we want
      my $response3 = $ua->get($url);
      if($response3->is_success) {
        return $response3->content;
      }
      else {
        print "URL request failed for url:$url\n";
      }
    }
  }
  else {
    die "Problems getting through login screen\n";
  }
}

Lawson Reporting using DME from perl

I have done a lot of reporting in Lawson using the Addins Query Wizard, but because I always have the need to make Lawson Reporting faster and automated, the Query Wizard shows its weaknesses.  You simply can't automate Lawson Reporting through the Addins.  But one thing that is a feature of the Addins that opens up a lot of Lawson automated reporting solutions is that you can get the URL that is generated by the Addins to use from any method that can access a web site.  I will show you how to automate Lawson Reporting using DME from perl.

Sometimes, Lawson reporting calls for more powerful measures than excel can handle through the Office Add-ins, and well, I know that a scripting language like perl is more than capable through my years of prior experience doing data analysis.  Sometimes, 65536 data records is a limitation of Excel that eliminates it as a solution option when trying to solve a Lawson reporting problem.  Bring on perl!

So my goal was to figure out how to use perl to query Lawson data through the DME that is the same method used by the Office Addins.  All the Office Add-ins really do is generate a web URL that passes arguments to the Lawson Reporting DME interface. This is important in our environment where our Lawson system is hosted, so we don't have the database connections or server-side programming that in-house Lawson setups would have.  So this means that I need to use HTTP access from perl to navigate through the HTTP basic authentication to query using the DME.  This is the only option that we have that would use Lawson security, as our ODBC connection is an all-access connection. Any Lawson Reporting solution should obey security access rules if you ever plan to distribute it.

So let me show you how to use perl to whip the Lawson data into shape.

Step 1)

Find a DME query.  Easy.  Just use the Lawson Office Add-ins to develop the query, run the query once, click on the little magnifying glass at the top of the Office Add-ins tool to get the full DME call.  Done!

Step 2)

Very simple perl code.  For this example, pulling Company, Employee, first and last name from the EMPLOYEE table in our PROD environment via the DME query.

#!/usr/bin/perl
use strict;
use LWP::UserAgent;

Next goes your DME query that you pulled from the Office add-ins.

# copy and paste the URL directly from the Office Add-ins Query Wizard
# IMPORTANT: if your server has a port, you NEED to add it in, otherwise the auth won't work
my $simpleURL = "http://my.lawson.server.com:30077/cgi-lawson/dme.exe?PROD=PROD&
  FILE=EMPLOYEE&INDEX=EMPSET1&KEY=1&
  FIELD=COMPANY;EMPLOYEE;FIRST%2DNAME;LAST%2DNAME&MAX=600&
  OUT=CSV&DELIM=%09&NOHEADER=TRUE";

Next, initialize the perl LWP::UserAgent module to do the HTTP request.  If you don't know your authentication realm, copy and paste the DME query into a blank web browser window, hit enter, and see what the prompt says.  Or just run this script, and an error will pop up that shows you, more details on that down below.

my $ua = LWP::UserAgent->new;
# set the basic HTTP auth credentials
# arguments are server:port, authentication realm, username => password
$ua->credentials('my.lawson.server.com:30077','Lawson Restricted Access','username'=>'password');
my $response = $ua->get($simpleURL);

Now we will verify the response, and print out an error that will contain the reason for failure.

 

if($response->is_success) {
  print $response->content;
}
else {
  # this will print out any errors
  # IMPORTANT: This also will tell you what your auth realm is if you don't know it
  die "Error: ", $response->header('WWW-Authenticate') ||
  'Error accessing', "\n", $response->status_line, "\n at $url\n Aborting";
}

Here is the full code listing for you copy and pasters.

#!/usr/bin/perl
use strict;
use LWP::UserAgent;

# copy and paste the URL directly from the Office Add-ins Query Wizard
# IMPORTANT: if your server has a port, you NEED to add it in, otherwise the auth won't work
my $simpleURL = "http://my.lawson.server.com:30077/cgi-lawson/dme.exe?PROD=PROD&
  FILE=EMPLOYEE&INDEX=EMPSET1&KEY=1&
  FIELD=COMPANY;EMPLOYEE;FIRST%2DNAME;LAST%2DNAME&MAX=600&
  OUT=CSV&DELIM=%09&NOHEADER=TRUE";

my $ua = LWP::UserAgent->new;
# set the basic HTTP auth credentials
# arguments are server:port, authentication realm, username => password
$ua->credentials('my.lawson.server.com:30077','Lawson Restricted Access','username'=>'password');
my $response = $ua->get($simpleURL);

if($response->is_success) {
  print $response->content;
}
else {
  # this will print out any errors
  # IMPORTANT: This also will tell you what your auth realm is if you don't know it
  die "Error: ", $response->header('WWW-Authenticate') ||
    'Error accessing', "\n", $response->status_line, "\n at $url\n Aborting";
}

Perl Constructor Tutorial

During all of my years of perl development, I really got used to using the constructor style used by many modules on CPAN, that is, I loved being able to instantiate a module like this:

my $newObject = MyClass->new( arg1 => "firstarg", arg2 => "secondarg");

I like this style because it compresses code into one line instead of calling methods to specify the args like this:

my $newObject = MyClass->new();
$newObject->arg1("firstarg");
$newObject->arg2("secondarg");

So I decided that I would write a perl constructor template that I could reuse during perl module development.

The main features I would look for in a constructor in any language are the following:

  1. Supports the inline arguments as explained above.
  2. Supports the copy constructor.
  3. Supports default arguments that you can define inside of your module.
  4. The inline arguments are allowed to override the default arguments.

So with these goals in mind, here is the constructor template that I use in my perl modules:

# default arguments
my %fields = ( arg1 => 0, arg2 => undef,);

sub new {
  my $that = shift;
  my $class = ref($that) || $that;
  my $self = {};
  if(ref($that)) { # copy
    %$self = %$that;
  }
  else { # new
    # the default arguments are defined in the $self object including %fields
    # the @_ brings in the inline arguments passed to the function
    # listing @_ after %fields will override %fields defaults
    $self = { %fields, @_, };
  }
  bless $self, $class;
  return $self;
}

Please see the comments inline above for the best description of the constructor template, and feel free to use it in your perl module projects.

Syndicate content