There are some computational tasks of processing structured data that involves multi-table data sources joined through cascaded foreign key relationships. For example, you want to find particular orders from the order management system, where the following structure of cascaded foreign key relationships is involved. Computational target: Find orders in which the suppliers and customers are from the same region; group these orders by customers and compute total order amount and number of orders in each group.
Data structure is shown in the following figure:
To perform the computation with the database, you use the following SQL script:
select cid,count(ordered),sum(price*quantity) from orders
left join customer on orders.cid=customer.cid
left join region a1 on customer.city=a1.city
left join product on orders.pid=product.pid
left join supplier on product.pid=supplier.sid
left join region a2 on supplier.city=a2.city
where a1.region=a2.region and to_char(orderdate,’YYYY’)=?
Group by cid
Now let’s look at how it is solved using esProc file computing approach.
Define the cellet parameter year and perform related computing with cascaded foreign keys relationships using the following esProc script orders.dfx:
|
A |
1 |
=file(“D:/files/orders.b”).cursor@b() |
2 |
=A1.select(string(orderdate, “yyyy”)==string(year)) |
3 |
=file(“D:/files/customer.b”).import@b() |
4 |
=file(“D:/files/product.b”).import@b() |
5 |
=file(“D:/files/supplier.b”).import@b() |
6 |
=file(“D:/files/region.b”).import@b() |
7 |
=A3.switch(city,A6:city) |
8 |
=A5.switch(city,A6:city) |
9 |
=A4.switch(sid,A5:sid) |
10 |
=A2.switch(pid,A9:pid) |
11 |
=A10.switch(cid,A7:cid) |
12 |
=A11.select(pid.sid.city.region==cid.city.region) |
13 |
=A12.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount) |
14 |
result A13 |
A1: Create a file cursor for local Orders table to process data in segments and avoid memory overflow.
A2: Find orders for a certain year. year is a pre-specified cellset parameter.
A3-A6: Import dimension tables from local machine – Customer, Product, Supplier and Region.
A7: Switch values of the foreign key “city” in the table sequence Customer into corresponding records references in the table sequence Region.
A8: Switch values of the foreign key “city” in the table sequence Supplier into corresponding records references in the table sequence Region.
A9: Switch values of the foreign key “sid” in the table sequence Product into corresponding records references in the table sequence Supplier.
A10: Switch values of the foreign key “pid” in the table sequence Orders into corresponding records references in the table sequence Product.
A11: Switch values of the foreign key “cid” in the table sequence Orders into corresponding records references in the table sequence Customer.
A12: Filter records of the table sequence Orders by the criterion. With switches from A7 to A11, you can write the filtering criterion as pid.sid.city.region==cid.city.region.
A13: Perform group and aggregate based on the filtering result.
A14: Return A13 to an external program.
If there is the great volume of data in the fact table, Orders, you can use esProc multithreaded-parallel-processing approach to increase processing performance. As for the dimension tables, they are small enough to be entirely loaded into memory. All dimension tables can be accessed by every thread.
esProc script for doing this:
|
A |
B |
C |
1 |
=file(“D:/files/customer.b”).import@b() | ||
2 |
=file(“D:/files/product.b”).import@b() | ||
3 |
=file(“D:/files/supplier.b”).import@b() | ||
4 |
=file(“D:/files/region.b”).import@b() | ||
5 |
=A1.switch(city,A4:city) | =A3.switch(city,A4:city) | =A2.switch(sid,A3:sid) |
6 |
4 | ||
7 |
fork to(A6) | =file(“D:/files/orders.b”).cursor@bz(;,A7:A6) | |
8 |
=B7.switch(pid,C5:pid) | ||
9 |
=B8.switch(cid,A5:cid) | ||
10 |
=B9.select(string(orderdate, “yyyy”)==string(year) && pid.sid.city.region==cid.city.region) | ||
11 |
=B10.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount) | ||
12 |
result B11 | ||
13 |
=A7.conj() | ||
14 |
=A13.groups(cid;sum(count):count,sum(amount):amount) | ||
15 |
result A14 |
A1-A4: Import dimension tables from local machine – Customer, Product, Supplier and Region.
A5: Switch values of the foreign key “city” in the table sequence Customer into corresponding records references in the table sequence Region.
B5: Switch values of the foreign key “city” in the table sequence Supplier into corresponding records references in the table sequence Region.
C5: Switch values of the foreign key “sid” in the table sequence Product into corresponding records references in the table sequence Supplier.
A6: Set the number of parallel tasks.
A7: Execute parallel processing according to A6.
B7: Create a file cursor for local Orders table, with each thread processing a part of the data.
B8: Switch values of the foreign key “pid” in the table sequence Orders into corresponding records references in the table sequence Product.
B9: Switch values of the foreign key “cid” in the table sequence Orders into corresponding records references in the table sequence Customer.
B10: Filter records of the table sequence Orders by the specified criterion and year. With all those switches, you can write the filtering criterion as pid.sid.city.region==cid.city.region.
B11: Data grouping and aggregation.
B12: Each thread returns its result got in B11.
A13: Merge A7’s results.
A14: Group and aggregate the merged data.
A15: Return A14 to an external program.
One point to note is that you can import data from either a database or a file using esProc script. So you can decide where the fact table and dimension tables should be stored as needed. For example, store tables whose data is manipulated little in the file system and put others with frequent data manipulation in the database. The more the data is stored in the file system, the higher the performance and the less the pressure on the database.