database - What are the approaches to the Big-Data problems? -
let consider following problem. have system containing huge amount of data (big-data). so, in fact have data base. first requirement want able write , read data base quickly. want have web-interface data-bases (so different clients can write , read data base remotely).
but system want have should more data base. first, want able run different data-analysis algorithm on data find regularities, correlations, abnormalities , on (as before care lot performance). second, want bind machine learning machinery data-base. means want run machine learning algorithms on data able learn "relations" present on data , based on predict values of entries not yet in data base.
finally, want have nice clicks based interface visualize data. users can see data in form of nice graphics, graphs , other interactive visualisation objects.
what standard , recognised approaches above described problem. programming languages have used deal described problems?
i approach question this: assume firmly interested in big data database use , have real need one, instead of repeating textbooks upon textbooks of information them, highlight meet 5 requirements - cassandra , hadoop.
1) first requirement want able write , read database quickly.
you'll want explore nosql databases used storing “unstructured” big data. open-source databases include hadoop , cassandra. regarding cassandra,
facebook needed fast , cheap handle billions of status updates, started project , moved apache it's found plenty of support in many communities (ref).
references:
- big data , nosql: 5 key insights
- nosql standouts: new databases new applications
- big data woes: database should use?
- cassandra , spark: match made in big data heaven
- list of nosql databases (currently 150)
2) want have web interface database
see list of 150 nosql databases see various interfaces available, including web interfaces.
cassandra has cluster admin, web-based environment, web-admin based on angularjs, , gui clients.
references:
3) want able run different data-analysis algorithm on data
cassandra, hive, , hadoop well-suited data analytics. example, ebay uses cassandra managing time-series data.
references:
- cassandra, hive, , hadoop: how picked our analytics stack
- cassandra @ ebay - cassandra summit
- an introduction real-time analytics cassandra , hadoop
4) want run machine learning algorithms on data able learn "relations"
again, cassandra , hadoop well-suited. regarding apache spark + cassandra,
spark developed in 2009 @ uc berkeley amplab, open sourced in 2010, , became top-level apache project in february, 2014. has since become 1 of largest open source communities in big data, on 200 contributors in 50+ organizations (ref).
regarding hadoop,
with rapid adoption of apache hadoop, enterprises use machine learning key technology extract tangible business value massive data assets.
references:
- getting started apache spark , cassandra
- what apache mahout?
- data science apache hadoop: predicting airline delays
5) finally, want have nice clicks-based interface visualize data.
visualization tools (paid) work above databases include pentaho, jasperreports, , datameer analytics solutions. alternatively, there several open-source interactive visualization tools such d3 , dygraphs (for big data sets).
references:
Comments
Post a Comment