Hadoop + MySQL == Killer Combo?
[Firstly I would like to declare my ignorance of Databases, Data warehousing and Data processing. Hopefully I will be better with time.]
So last couple of days I have been working a lot with Hadoop and MySQL. And this is what I have been doing
- Take the raw data push it off to HDFS
- Clean data using simple streaming jobs in Python
- Get the data to local drive, ingest into MySQL
- Run queries and get data visualized
And I must tell you, I have been seeing a lot of patterns with the jobs that I have been running. This typically includes:
- Filtering: Taking some dataset and filtering out data based on some predicate(either or key, value of key-value pairs).
- Joins: Taking dataset A and dataset B and merging them so that it has some predicate in common.
- Aggregating: Given key,value pairs emitting key,function(value1,value2,….valueN)
I know HBase and Pig might have better implementation of these function, but it would be nice to have some library for hadoop jobs which developers can reuse too.
But the key question is should I use Hadoop just for data cleaning and do the rest in MySQL? If so what kind of loads can MySQL handle on same amount of boxes? If not then how quick can Hadoop generate results of SQL like queries? Or even better do Hadoop and MySQL complement each other?
Honestly, the answer to these questions are subjected to following constraints:
- Sizes of datasets
- Frequency and complexity of executing SQL queries vs. programming and running hadoop jobs
- Engineering time at hand
I will write more about my experiences soon. So stay tuned!
posted 1 year ago