DB2 REVIEW HWS
CISC 7512X HW# 2 (due by 3rd class;): For the below `bank' schema:
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
account(accountid,customerid,description,)
transaction(transactionid,trantimestamp,accountid,amount)
A customer may have several accounts, and each account may participate in many transactions. Each transaction will have at least two records, one deducting amount from an account, and one adding amount to an account (for a single transactionid, the sum of amounts will equal zero).
Using SQL, answer these questions (write a SQL query that answers these questions):
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
account(accountid,customerid,description,)
transaction(transactionid,trantimestamp,accountid,amount)
A customer may have several accounts, and each account may participate in many transactions. Each transaction will have at least two records, one deducting amount from an account, and one adding amount to an account (for a single transactionid, the sum of amounts will equal zero).
Using SQL, answer these questions (write a SQL query that answers these questions):
- What is the balance of accountid=42?
- What was the transaction amount of transactionid=42?
- Which transactionids do not sum up to zero (are invalid)?
- List of customers without accounts?
- What is the balance (total across all accounts) for customerid=42?
- What is the total balance of all customers living in zip code 10001?
- Which zip code has the highest balance?
- List the top 1% of customers (ordered by total balance).
- Using balances for previous two months, predict what the balances will be next month. (tip: find slope of a line; x-axis is days, y-axis is balance. 2 previous months means you have 2 points, finding slope is easy. Use slope to predict where next month's balance will be.)
- List top 10 fastest growing accounts (using previous 2 months). (tip: same as above, fastest growing means steepest slope).
- Write a query to add 0.01% to each savings account (note that the money has to be accounted for).
- For each account, what was the closing balance on December 31, 2016?
- What percentage of bank's money is held by people in the tri-state area today? (NY, NJ, CT)
CISC 7512X HW# 3: Imagine you have a database table with columns: phoneid, time, gps_latitude, gps_longitude. Assume these records are logged approximately every few seconds for every phone. Your task is to detect speeding: Write a database query (in SQL) to find anyone whose *average* speed is between 90 and 200mph for at least a minute. If can't write SQL query, write detailed procedural speudo code (assume input is coming from a comma delimited text file). Submit code via email, with subject "CISC 7512X HW3".
CISC 7512X HW# 4: In the not-so-distant future, flying cars are commonplace---everyone on the planet got one. Yes, there are ~10 billion flying cars all over the globe. Each one logs its coordinates every 10 milliseconds, even when parked. Assume x,y,z coordinates, with z being altitude, and x,y, some cartesian equivalent of GPS. To avoid accidents, regulation states that no car can be next to any other car by more than 10 feet while in the air (z > 0) for longer than 1 second. Cars can go really fast, ~500mph. YOUR TASK: write an algorithm and program to find all violators. Assume input is a HUGE file (10 billion cars logging "VIN,timestamp,x,y,z" every 10 milliseconds all-the-time).
Install Apache Hadoop. [hadoop]. Write a Hive query (or a series of queries), or a MapReduce program to find all violators (cars that are next to other cars while in flight). Assume data is in "cars" table in Hive (or "/app/cars/data" file on HDFS). What is the running time of your algorithm? If it's O(N^2), can you make it run in O(N log N) time? (note that with this much data, N^2 is not practical, even N log N is a bit long). Using your 1 node Hadoop cluster, estimate the amount of resources this whole task will consume (to apply it on 10 billion cars), and put a dollar amount value (assuming it costs $0.10/hour to rent 1 node (machine); how much will your solution cost per day/month/year?); rationalize your answer. (note that you can't answer "I'll rent 1 node, and let it run until it's done."; You must process data at least as fast as it is being generated by all those billions of cars).
Good Hadoop installation guide. Hive installation is much simpler, just unzip, set HIVE_HOME, add bin folder to PATH, and then just run "hive". Here's some tips on trying to get Hive running for first time (links may be outdated).
Submit whatever you create to solve this problem (source code for map reduce tasks, or hive queries, etc.,). Note, your solution must run (on small dataset) on a 1-node hadoop cluster.
CISC 7512X HW# 5 (due by Nth class): Your buddy stops over for lunch and tells you about this wonderful idea of building apps for phones (for profit!). The gist of the idea: ride sharing! (``Urgh, not again!'', you think). Unlike other ride-sharing ideas, this app is designed for the usual commuter who uses the car to get to work---and is willing to share the ride with someone else to lower their costs. Going out of the way to pickup folks is out of the question (the driver also needs to get to work themselves). Also, the driver prefers the fastest possible route (highways, etc.,) even if it means not picking up someone. Since everyone (including the driver) are benefitting from the ride, the goal is to lower the commute cost for everyone (including driver and passenger [passenger would use their own car if it costs them less]). The business takes a small slice of the money saved (so it's a win-win for everyone involved). Also, folks will be able to pay for the ride in bitcoins. This all seems like crazy talk until your buddy mentions there's a potential $10m investment, and all they need from you is a working prototype and a write-up of the architecture by next week.
Your task: Design and build a database to run this business. What tables would you need? What events would you capture? Etc. Write up what interface and functionality would be needed to interact with the database. Make the investors see that this is a real viable idea that will actually work. Produce a business plan, design document, whitepaper, architecture, prototype, etc., whatever it takes to get that investment.
CISC 7512X HW# 6 (due by Nth class;): Install HBase on your cluster from HW4. You have a relational database from HW2:
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
account(accountid,customerid,description,)
transaction(transactionid,trantimestamp,accountid,amount)
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
account(accountid,customerid,description,)
transaction(transactionid,trantimestamp,accountid,amount)
You would like to port it to HBase. How would you organize the data to make it easy to answer HW2 questions using HBase? What would you use as keys? Do you need to store multiple copies of the data?
Outline pseudo code (please don't write actual java) that would answer the following questions using your design:
- What is the balance of accountid=42?
- What was the transaction amount of transactionid=42?
- Which transactionids do not sum up to zero (are invalid)?
- List of customers without accounts?
- What is the balance (total across all accounts) for customerid=42?
- What is the total balance of all customers living in zip code 10001?
- Which zip code has the highest balance?
CISC 7512X HW# 7: Download and install Spark. spark.apache.org. Port the code from HW4 to run on Spark/Scala [run a timy example using Spark/Scala].
Submit a Scala/Spark script (whatever you type in spark-shell) to solve HW4.
CISC 7512X HW# 8: Write an implementation of k-Means algorithm in SQL. Imagine you have a table such as cust_attributes(custid,attributename,attributetype,attributevalue). You'd like to use only "numeric" values to cluster all of your customers into say 7 clusters. In other words, you'd like to generate another table with columns cust_cluster(custid,clusterid), where clusterid is representative of this customer to the other customers based on that customer's attributes. Note that you'll need some mechanism of running the same query over and over again---you can do that via an external script, or use recursive queries to iterate. [before, you've used SQL as a query engine---in this homework you're using it as a computation engine].
CISC 7512X HW# 9: (this homework is inspired by an interview question I've been asked): In this homework you'll be using this file:
[quotes_UsConsolidated_....txt.gz].
[quotes_UsConsolidated_....txt.gz].
This file includes US stock quote data. Each row is a quote. A quote could be from a single venue or a consolidated quote across all venues. Each file is 10 minutes for a subset of stocks.
File Specification: The first row from the sample file:
86|1|18:10:00.000|U|0||5|3|BP.N||3=13:09:59.993|1=16|0=39.93|2=0x52|8=13:09:59.993|6=21|5=39.94|11=2017-12-11|1715=13:09:59.993|7=0x52|1427=C|
86|1|18:10:00.000|U|0||5|3|BP.N||3=13:09:59.993|1=16|0=39.93|2=0x52|8=13:09:59.993|6=21|5=39.94|11=2017-12-11|1715=13:09:59.993|7=0x52|1427=C|
Each row contains two parts:
The header is comprised of 10 pipe-delimited fields. The only relevant field for this problem is the 9th, the symbol.
The header is comprised of 10 pipe-delimited fields. The only relevant field for this problem is the 9th, the symbol.
Symbols have the form "AAA.BB" where AAA is the ticker and BB is the venue.
Quotes with symbols ending in "." (e.g. "AAA.") are consolidated quotes.
Quotes with venues specified (e.g. "AAA.BB") are venue quotes that contribute to the consolidated quotes for their ticker (e.g. "AAA.").
Quotes with symbols ending in "." (e.g. "AAA.") are consolidated quotes.
Quotes with venues specified (e.g. "AAA.BB") are venue quotes that contribute to the consolidated quotes for their ticker (e.g. "AAA.").
The body is comprised of a variable number of pipe-delimited key-value fields representing the latest known value for a ticker/venue combination.
If a key is missing, its value is retained from the prior entry for that ticker/venue. Values for a given ticker/venue are valid for a given trade date until explicitly updated.
If a key is specified but has no value (e.g. "|3=|") then the prior value does not carry over, but is instead missing. This may occur if, for example, a venue has no bids for a security at the moment.
If a key is missing, its value is retained from the prior entry for that ticker/venue. Values for a given ticker/venue are valid for a given trade date until explicitly updated.
If a key is specified but has no value (e.g. "|3=|") then the prior value does not carry over, but is instead missing. This may occur if, for example, a venue has no bids for a security at the moment.
The relevant keys are:
0: bid
1: bid size
3: bid time
5: ask
6: ask size
8: ask time
11: trade date
0: bid
1: bid size
3: bid time
5: ask
6: ask size
8: ask time
11: trade date
In general, both venue and consolidated quotes are valid until updated. The consolidated quote represents the highest valid bid (or lowest ask) across all venues. Certain condition codes on venue quotes can indicate that the venue is no longer valid for inclusion in the consolidated quote.
Task1: Write ETL code to save the following fields from the venue quotes in Parquet format:
ticker, date, time, venue, bid, bid size, ask, ask size.
ticker, date, time, venue, bid, bid size, ask, ask size.
The data written should be fully reflective of the state of the market as of each quoteรข€”i.e. if the current bid is unspecified in a row on the input because it is unchanged, it nonetheless should appear in the Parquet data. If the current bid is unavailable because it was explicitly nulled (i.e. a |0=| entry in the file) it should appear as a null in the Parquet data.
Task2: For each date, ticker and minute from 09:31 through 16:00, calculate the number of venues that are showing the same bid price as the consolidated quote at the end of the minute interval. Include only quotes for the trade date specified in the file name.
Submit the Spark program to do Task1 and Task2.
CISC 7512X HW# 10: In this homework, you'll write a few utilities. These should be flexible enough to run on schedule (cron, etc.). I highly recommend you use GPG for this (generate public/private key pair; do not keep private key anywhere near these programs), etc., don't recreate stuff if you can just use other programs/libraries. (I don't expect each of these to be longer than say 10-20 lines of code).
Write a program to peform a database backup. Your program accepts a database connection info, database table and output directory as parameters. Your program will start, check that there are no other instances of the program running for that table (if there are, your program exits). Your program then proceeds to dump all of the data from the table to a .csv file (comma delimited). No headers. It must be a .csv file since you can load that file into anything (even open in Excel), in case there's an emergency serious enough for you to actually *need* the backup urgently. Once your program dumps the data from the table into the file, your program generates a .comp file for the compressed .csv file it just created. If you start with "mytable" in some database, you should end up with "mytable.YYYYMMDDHHMISS.csv.gz" and "mytable.YYYYMMDDHHMISS.csv.gz.comp" in the output directory. That represents an image of that table as of that timestamp.
Write (another) program that accepts an "input" folder, "encrypted" folder, and "public key" file. This program ensures there is only 1 copy of it running at any given time (each instance attempts to get a lock on some file, if it fails to get the lock, it exits). When a FILENAME.comp shows up in the "input" folder, your utility will encrypt the FILENAME using the public key and place it into the "encrypted" folder. Your utility then verifies that the encrypted file was created successfully (length is not zero, and there were no errors such as running out of disk space, etc.), your utility creates FILENAME.comp file in the "encrypted" folder, and erases the FILENAME.comp and FILENAME from the "input" folder. You then loop thorugh the "input" folder and erase all files (those without the .comp) with create date older than 30 days.
Write another program, that accepts a "cleanup" folder. The program makes a list of all the files, sorts them by their modified timestamp, keeps the latest 20 entries, and from what's left, erases anything older than 20 days (you don't want to erase an "old" file if it's the only one in the folder).
Here's what you can do with this setup: have first program run daily (backup your database). Have 2nd program run on the folder generated by 1st program (encrypt data) and set output folder to point to Dropbox (or something similar). You now have an encrypted backup of your database in the cloud, that's refreshed daily! Run 3rd program on the dropbox folder to cleanup old backups (so you don't fill up the space). At any given time, you'll have 20 days of backups that nobody except you (via private key) can access.
Comments
Post a Comment