SQL Quickly and Dirtily Extracts a Field from a CSV File
October 22, 2008 Ted Holt
A colleague and I were recently tracking the way data progresses through a poorly documented information system. We were trying to determine at which point in the process a certain field’s decimal positions were being discarded. We began at the beginning, which happened to be a CSV file from a PC-based system. How could we quickly determine whether or not the CSV file had values in the decimal positions of the seventh field? We used SQL. I’ll provide a simple illustration. Here are the commands, in case you want to try this yourself. Let’s first create a CSV file we can play with. Run this command from the CL command line: crtpf qtemp/custcsv rcdlen(80) Next, load some data into the CSV file. INSERT INTO QTEMP/CUSTCSV SELECT CUSNUM||',"'||trim(LSTNAM)||'","'||INIT||'","'|| trim(STREET)||'","'||trim(CITY)||'","'||STATE||'",'||ZIPCOD|| ','||trim(char(CDTLMT))||','||CHGCOD||','|| trim(char(BALDUE))||','||trim(char(CDTDUE)) FROM qiws/qcustcdt The example CSV file, QTEMP/CUSTCSV, looks like this: 938472,"Henning","G K","4859 Elm Ave","Dallas","TX",75217,5000,3,37.00,.00 839283,"Jones","B D","21B NW 135 St","Clay","NY",13041,400,1,100.00,.00 392859,"Vine","S S","PO Box 79","Broton","VT",5046,700,1,439.00,.00 938485,"Johnson","J A","3 Alpine Way","Helen","GA",30545,9999,2,3987.50,33.50 397267,"Tyron","W E","13 Myrtle Dr","Hector","NY",14841,1000,1,.00,.00 389572,"Stevens","K L","208 Snow Pass","Denver","CO",80226,400,1,58.75,1.50 846283,"Alison","J S","787 Lake Dr","Isle","MN",56342,5000,3,10.00,.00 475938,"Doe","J W","59 Archer Rd","Sutter","CA",95685,700,2,250.00,100.00 693829,"Thomas","A N","3 Dove Circle","Casper","WY",82609,9999,2,.00,.00 593029,"Williams","E D","485 SE 2 Ave","Dallas","TX",75218,200,1,25.00,.00 192837,"Lee","F L","5963 Oak St","Hector","NY",14841,700,2,489.50,.50 583990,"Abraham","M T","392 Mill St","Isle","MN",56342,9999,3,500.00,.00 Now, find the records that have a lowercase “e” in the second position of the city field, which is the fifth field. with t1 as (select rrn(custcsv) as rrrn, substr(custcsv,locate(',',custcsv)+1) as x from qtemp/custcsv), t2 as (select rrrn, substr(x,locate(',',x)+1) as x from t1), t3 as (select rrrn, substr(x,locate(',',x)+1) as x from t2), t4 as (select rrrn, substr(x,locate(',',x)+1) as x from t3), t5 as (Select rrrn, substr(x,1,locate(',',x)-1) as city from t4) select * from t5 where city like '"_e%' Here’s the result set. RRRN CITY 4 "Helen" 5 "Hector" 6 "Denver" 11 "Hector" So, how does it work? Common table expression T1 extracts everything that follows the first comma, which is the second field and following. It also gets the relative record number of the original file. with t1 as (select rrn(custcsv) as rrrn, substr(custcsv,locate(',',custcsv)+1) as x from qtemp/custcsv), Common table expression T2 carries along the relative record number, and extracts everything that follows the first comma of field T1, which is the third field and the following: t2 as (select rrrn, substr(x,locate(',',x)+1) as x from t1), Common table expressions T3 and T4 work like T2, carrying along the relative record number of the original file and peeling one field from the front of the CSV data. Common table expression T5 carries along the relative record number and extracts the first field of T4, which is the fifth field of CUSTCSV. t5 as (Select rrrn, substr(x,1,locate(',',x)-1) as city from t4) All that remains to be done is select the desired records. select * from t5 where city like '"_e%' I’d hesitate to use this technique in production, but for quick and dirty data analysis, it worked great for us. My colleague and I determined that the decimal positions were in the CSV file, and were able to continue with our analysis.
|