database - What are the approaches to the Big-Data problems? -


let consider following problem. have system containing huge amount of data (big-data). so, in fact have data base. first requirement want able write , read data base quickly. want have web-interface data-bases (so different clients can write , read data base remotely).

but system want have should more data base. first, want able run different data-analysis algorithm on data find regularities, correlations, abnormalities , on (as before care lot performance). second, want bind machine learning machinery data-base. means want run machine learning algorithms on data able learn "relations" present on data , based on predict values of entries not yet in data base.

finally, want have nice clicks based interface visualize data. users can see data in form of nice graphics, graphs , other interactive visualisation objects.

what standard , recognised approaches above described problem. programming languages have used deal described problems?

i approach question this: assume firmly interested in big data database use , have real need one, instead of repeating textbooks upon textbooks of information them, highlight meet 5 requirements - cassandra , hadoop.


1) first requirement want able write , read database quickly.

you'll want explore nosql databases used storing “unstructured” big data. open-source databases include hadoop , cassandra. regarding cassandra,

facebook needed fast , cheap handle billions of status updates, started project , moved apache it's found plenty of support in many communities (ref).

references:

2) want have web interface database

see list of 150 nosql databases see various interfaces available, including web interfaces.

cassandra has cluster admin, web-based environment, web-admin based on angularjs, , gui clients.

references:

3) want able run different data-analysis algorithm on data

cassandra, hive, , hadoop well-suited data analytics. example, ebay uses cassandra managing time-series data.

references:

4) want run machine learning algorithms on data able learn "relations"

again, cassandra , hadoop well-suited. regarding apache spark + cassandra,

spark developed in 2009 @ uc berkeley amplab, open sourced in 2010, , became top-level apache project in february, 2014. has since become 1 of largest open source communities in big data, on 200 contributors in 50+ organizations (ref).

regarding hadoop,

with rapid adoption of apache hadoop, enterprises use machine learning key technology extract tangible business value massive data assets.

references:

5) finally, want have nice clicks-based interface visualize data.

visualization tools (paid) work above databases include pentaho, jasperreports, , datameer analytics solutions. alternatively, there several open-source interactive visualization tools such d3 , dygraphs (for big data sets).

references:


Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -