HiFiCoding: High Performance Computing Blog

Database systems with MapReduce

Details: Published: Thursday, 09 January 2014 06:09

Advancement in the database industry remains extremely vivid. In just the few years, new data scaling models have totally changed the efficiency that database systems can provide. These include the use of very huge, multi-terabyte memories; which can considerably increase the amount of data cached in the main memory; and the movement of database into Meta storage to speed functions such as data read, encryption, compression, and much more. One such prominent model is known as MapReduce.

MapReduce is a model for generating and processing large data. MapReduce is motivated by the map and reduce features regularly used in Functional programming.

Users specify a mapping function that will processes key/value pairs in order to create an intermediate key/value pairs.
Users specify a reduction function that merges all intermediate key/value pairs associated with the same intermediate key.

Programs written in this functional style will be parallelized automatically and executed over a large cluster of machines. The run-time systems will take care of details of dividing the input data, scheduling the execution across the cluster of machines, managing inter-system communication and finally managing machine failures. This allows developers without any prior experience with distributed and parallel systems to easily utilize the resources of a large distributed system.

Origins of MapReduce

The idea behind MapReduce originated in popular Internet Search engine companies, like Google, Yahoo, Ask etc... Search engines companies utilize massive farms of servers that crawl the web sites and retrieve useful web pages into local files. They process this data to build search indexes.
Algorithms are used to determine the quality of a web page and the significance to search terms. The mapping function will identify latent search parameter in a web page. The reduction function will determine the number of times a search parameter is used in a page.

MapReduce Implementation

A popular implementation of MapReduce is Apache Hadoop. Hadoop is a MapReduce based data processing model engineered to execute queries and other read operations against colossal datasets that can be terabyte and even petabyte in size. Here the data is added to the Hadoop Distributed File System (HDFS). Hadoop then scans through the data to produce results. These results are then finally loaded into other files. Using the MapReduce concept, Hadoop splits a problem and sends the sub-problems to different machines, and lets each machine solve its sub-problem in parallel. It then combines all the sub-problem together and writes out the final solution into files for further operations.

Conclusion

Parallel processing is now widespread and it very important to harness its full potential. MapReduce is a promising model for multicore environments, to fully utilize the power of such resources.

Reference:

http://en.wikipedia.org/wiki/MapReduce
http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf
http://www.willowgarage.com/sites/default/files/ChuCT_etal_2006.pdf
http://infolab.stanford.edu/~ullman/pub/mapred.pdf
http://hadoop.apache.org/
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf
http://www.oracle.com/technetwork/database/hadoop-nosql-oracle-twp-398488.pdf

*If you find something is misleading or not correct then please throw some light on it.

Distributed Space based architecture

Details: Published: Thursday, 09 January 2014 06:06

Application scalability has become one of the primary concerns for application designers and administrators. To add to the concern is the fact that existing tier-based business-critical applications are not able to achieve linear scalability due to the inherent limitations of existing architectures. What is needed is a software architecture pattern that allows for linear scalability of applications. Space based architecture (SBA) is one such pattern.

SBA works by co-locating the data and business logic thereby decreasing the inter-tier messaging, which makes it an ideal architecture for building scalable and fault tolerant applications. SBA is closely related to Shared Nothing (SN) based distributed computing architecture used by Google, Amazon and other well know companies for addressing the application scalability. It follows many of the principles of service oriented architecture (SOA), Representational State Transfer (REST), Event driven architecture (EDA), as well as certain elements of grid computing.

SBA is built on the concept of “Tuple Space” paradigm, a technology perfectly suited for implementing distributed caching in parallel computing.

Tuple: A tuple is an ordered list of elements. Tuples have their own data type and can be identified using a key comprising one or more fields. Tuples are also referred to as ''Entry Objects'' in Tuple Space terminology.
Space (Tuple Space): Space or more formally Tuple Space is an implementation of the associative memory paradigm for parallel/distributed computing i.e. it can be seen as a shared distributed cache memory. Tuple space enables concurrent accessing of tuples. Tuple Space is a paradigm for inter-process communication, information tuples are shared instead of the alternate message passing paradigm. Tuple Space has been implemented in many technologies and languages.

Space Based Architecture emphasizes development of applications as a collection of independent, loosely coupled services, minimizing the need to divide the logic into Tiers.
The services thus created are basic units of SBA, communicating with data and logic co-located within the same process and known as ''Processing Unit'' (PU). A PU is an independent entity that can be scaled easily by adding more hardware resources and by deploying it to the additional hardware without changing the existing setup.

SBA Implementations: A number of products are available that implement the concepts of Tuple Space. A few significant products are:
1. Gigaspaces: Gigaspaces SBA combines and integrates parallel processing (Processing Grid), distributed caching (Data Grid), and content-based distributed messaging (Messaging Grid) to provide a common platform for high availability, clustering, consistency services and location transparency across all tiers.
2. Javaspaces: Javaspaces is a specification for implementation of Tuple spaces using JINI.

References:
•    Javaspaces, Oracle, http://java.sun.com/developer/technicalArticles/tools/JavaSpaces/
•    Space Based Architecture, Wikipedia, http://en.wikipedia.org/wiki/Space-based_architecture
•    Gigaspaces, http://www.gigaspaces.com/
•    The Scalability Revolution-From Dead End to Open Road, Gigaspaces, http://www.gigaspaces.com/files/main/Presentations/ByCustomers/white_papers/FromDeadEndToOpenRoad.pdf
•    Migrating from JEE application to SBA, Gigaspaces, http://www.gigaspaces.com/files/main/Presentations/ByCustomers/white_papers/MigratingFromJEEtoSBA.pdf
•    Wikipedia http://en.wikipedia.org/wiki/Tuple_space

*If you find something is misleading or not correct then please throw some light on it.

Connectome as a template for Network, Cognitive and Cloud systems

Details: Published: Thursday, 09 January 2014 06:02

I stumbled upon a paper published by Proceedings of the National Academy of Sciences (PNAS) called “Network architecture of the long-distance pathways in the macaque brain” with key implication for reverse-engineering the human brain and developing a comprehensive network of cognitive-computing chips. The networks lead to the application of network-theoretic analysis that has been successful in understanding the Internet, social networking, and search in the world-wide web.

Don’t get me wrong. This is not a new field of study. Cognitive science has its roots in the 1950’s. However it’s only in the past 5 years that some significant Research milestones have been reached. The understanding of network of nervous system could act as a template for architecting future networks, cognitive systems, and resource allocation patterns in utility (cloud) based computing models.

Key highlights of PNAS research included the finding of how information travels and is processed across the nervous system. The brain network does not to scale freely like the social networks, which are logical and can grow without any constraints. This finding helps us to design the network routing architecture of cognitive computing chip. By analysing the current state of the body, the brain also allocates priority to body functions. This study can help in understanding and developing resource allocation patterns.

Sebastian Seung: I am my connectome
Video on TED.com

Now let’s talk about Connectome. What is Connectome, you ask? Well a Connectome is a complete map of every neural connection in the brain. To know more about Connectome you can view the video podcast by Sebastian Seung.

I believe Connectome will open new avenues of study in advanced computing. The human nervous system is the most complex and advanced Network on earth — probably even in our galaxy (Atleast for now and/or as far as we know of). Just as the economy is interconnected and mainly controlled by a small but powerful core network, so too is our human brain. This discovery aligns astonishingly with nearly four decades of imaging studies that exhibit a ‘task-positive’ network implicates a goal-focussed performance and a ‘task-negative’ network activated when the brain is reserved and at restless.

Another reference for such research is the book Networks of the Brain, by author Olaf Sporns. In his book Sporns explains how the integrative nature of brain function can be illuminated from a complex network perspective. He introduces network theory to those working on theoretical network models. Sporns''s book unites function, dynamics , neural structure, and connectivity into a single and coherent framework.

Connectome goes a long way toward understanding the dynamic patterns of the brain that underlie behavior and cognition. With our current technology it will take at least 30-40 years (I am being extremely optimistic) to completely map and produce Connectome of the entire human nervous system. Connectome will offer synthesis of the sciences of complex networks that will be an essential foundation for future research in computer science.

Reference:
http://www.ibm.com/smarterplanet/us/en/business_analytics/article/cognitive_computing.html
http://hcp.600series.net/
http://www.kurzweilai.net/ibm-scientists-create-most-comprehensive-map-of-the-brains-network?utm_source=KurzweilAI+Daily+Newsletter&utm_campaign=2e47b007d7-UA-946742-1&utm_medium=email
http://www.dailygalaxy.com/my_weblog/2010/07/scientists-create-most-comprehensive-map-of-the-brains-network.html
http://www.modha.org/
http://humanconnectome.org/about/pressroom/tag/mapping-the-human-brain/
http://en.wikipedia.org/wiki/Connectome
http://www.ted.com/talks/sebastian_seung.html

*If you find something is misleading or not correct then please throw some light on it.

Patterns for Cloud Computing

Details: Published: Thursday, 09 January 2014 05:58

Patterns are a commonly used concept in computer science to describe good solutions to reoccurring problems in an abstract form. The cloud patterns speak a higher level of language and address system components. There are number of architecture and design patterns and best practices that help you select a cloud platform and implement cloud services and applications. In general, Cloud patterns fall into four categories storage, compute, management/administration and communication.

1. Storage patterns

Cloud storage provides remote storage and abstracts the storage medium away from the users. The design is adequately flexible to support a wide range of application requirements. Two patterns of cloud storage: table storage and blob storage. The table storage pattern allows the applications to store key/value pairs following a table structure while the blob storage pattern can be used to store any data.
The idea of storing data into the cloud is to avoid worry about DBA tasks. Most cloud vendor provides large scale key/value store as well as RDBMS services.

• Structured Storage

Storing data in a table structure while not demanding full relational semantics makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks.
For example, Amazon RDS gives you access to the capabilities of a familiar MySQL or Oracle database. This means that the code, applications, and tools you already use today with your existing databases can be used with Amazon RDS.

• Un-Structured Storage

It is a notion of storing large amount of unstructured data. Highly durable storage infrastructure designed for mission-critical and primary data storage. Objects are redundantly stored on multiple devices across multiple facilities.
For example, Amazon S3 is designed to make web-scale computing easier for developers. Amazon S3 can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure.

• Flyweight

A flyweight is an object that minimizes memory use by sharing as much data as possible with other similar objects; it is a way to use objects in large numbers when a simple repeated representation would use an unacceptable amount of memory. Many workloads present opportunities for sharing memory across virtual machines.
For example, several virtual machines might be running instances of the same guest operating system, have the same applications or components loaded, or contain common data. VMware ESX/ESXi systems use a proprietary transparent page-sharing technique to securely eliminate redundant copies of memory pages.
A workload of many nearly identical virtual machines might free up more than thirty percent of memory, while a more diverse workload might result in savings of less than five percent of memory.

2. Compute patterns

• Multi-tenancy

The key idea is to use the same set of resources to host the application for different customers. Consumers might utilize a public cloud provider’s service offerings or actually is from the same organization, such as different business units like finance, HR; but would still share infrastructure. From a provider’s viewpoint, multi-tenancy suggests an architectural and design approach to enable isolation, availability, economies of scale, segmentation, operational competence, and management; leveraging shared infrastructure, services, data, metadata, and applications across many different consumers.
To benefit from an elastic infrastructure regarding the enabled alignment of resource numbers to experienced workload, the overall scaling process of the applications has to be automated. User can scale your applications up and down to match your unexpected demand without any human intervention. Auto-scaling promotes automation and drives more efficiency.

i. Passive listener

Passive listener model uses a synchronous communication pattern where the client pushes request to the server and synchronously waits for the processing result. In the passive listener model, machine instances are typically sit behind a load balancer.

ii. Active Listener

Active worker model uses asynchronous communication patterns where the client put the request to a queue, which will be periodically polled by the server. After queuing the request, the client will do some other work and come back later to pick up the result.

• Big Data Processing

Businesses, researchers, data analysts, and developers can use cloud computing to process vast amounts of data easily and cost-effectively. Amazon Map-Reduce utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). In a nutshell, the Elastic MapReduce service runs a hosted Hadoop instance on an EC2 instance (master), and it’s able to instantly provision other pre-configured EC2instances (slave nodes) to distribute the MapReduce process, which are all terminated once the MapReduce tasks complete running.

3. Administration / Management patterns

Administration patterns differentiate two core aspects of service management: service deployment and service-level-management. Deployment patterns organize service definition, configuration, and monitoring, while other patterns address service-level management and regular operational maintenance.

• Design for Operations

How to make my application operations-ready can be the vital issue and that can be done by providing health status and logging.

• Cloud Deployment

Deployment patterns organize service definition, configuration, monitoring and deploying applications with desired configurations such as scale-out and high-availability requirements. Start, stop, and suspend cloud apps.

• Cloud Broker

Provide a single point of contact and management for multiple cloud service providers and maximize the benefits of leveraging multiple external clouds.
Technically, a cloud broker is able to:

Work seamlessly with different cloud services providers on behalf of customers. It includes taking care of system provisioning, monitoring, billing, etc. In some sense, it’s like service aggregation.
Ideally, move workloads among the service providers. No longer are you locked in with a particular service provider.
Maximize performance/price ratio of cloud services by shuffling workloads among the providers.
Scale the VMs beyond one service provider who may not have enough resources. Who says cloud is unlimited? In theory it’s so, but in reality every service provider has a limit which you just don’t hit normally.

The DMTF formed the Open Cloud Standards Incubator to assess the impacts of cloud computing on management and virtualization standards and to make recommendations for extensions to better align with the requirements of cloud environments.

4. Communication Patterns

A queue (or mailbox) service provides a mechanism for different machines to communicate in an asynchronous manner via message passing. These patterns address message exchange. Azure technology leverages Windows Communication Foundation (WCF) and REST APIs for Web service communication. You must consider partial trust models and the stateless nature of the application while implementing communication patterns.
Amazon Simple Notification Service (Amazon SNS) is a web service that makes it easy to set up, operate, and send notifications from the cloud. It provides developers with a highly scalable, flexible, and cost-effective capability to publish messages from an application and immediately deliver them to subscribers or other applications. It is designed to make web-scale computing easier for developers.

• Messaging

Share messages between applications in a scalable, reliable, and asynchronous way. Sending Instant Messages, e-mail, or alerts about resource and billing info. An alarm is comprised of a trigger. There are two types:
Condition, or state, trigger – Monitors the current condition or state; for example: A virtual machine’s current snapshot is above 2GB in size. A host is using 90 percent of its total memory. A data store has been disconnected from all hosts.
Event – Monitors events; for example: The health of a host’s hardware has changed.

Final Thoughts :

The thought process to reach an architectural pattern is as important as, the solution itself. Without this, you may easily get lost while applying an architecture patterns to a real world problem.

References :

http://www.doublecloud.org/2010/10/cloud-architecture-patterns-vm-factory/

http://en.wikipedia.org/wiki/Architectural_pattern_%28computer_science%29

http://msdn.microsoft.com/en-us/magazine/dd727504.aspx

http://www.slideshare.net/AmazonWebServices/aws-architectingjvariafinal

http://architects.dzone.com/news/cloud-computing-patterns

*If you find something is misleading or not correct then please throw some light on it.

Database scaling with No SQL

Details: Published: Thursday, 09 January 2014 05:55

For the past few months I have read a lot of articles on NoSQL vs. RDBMS. It’s almost like a religious war between MAC & PC users. For almost half of a century, RDBMS (the relational database) has been the dominant model for database management. But, today, non-relational, “NoSQL” databases are gaining mindshare as an alternative model for database management.

So the question is what is NoSQL and how is it different and/or better from the traditional RDBMS?
According to the definition from Wikipedia, “NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage.” To keep things simple, NoSQL stores data in a non-relational fashion.

RDBMS requires that data is normalized so that it can provide quality results and prevent duplicates and orphan records. Normalizing the data requires creation of more tables, which require table joins, and thus requiring more indexes and keys. The problem becomes more apparent with highly diverse datasets with lots of unstable indexes on them probably a hundred or so tables, and each table having varying indexes. I/O becomes chaotic when indexes of different tables are stored on different parts of HDD or SSD and you have concurrent reads/writes. In case of Cloud, the storage represented to the user may be different disks or different kinds of store; Cloud storage comes with abstraction; as databases start to grow into the terabytes or even petabytes, performance starts to fall off significantly.

NoSQL uses multi dimensional data structure and groups relevant data closely to reduce the I/O time required to return the query results. NoSQL also distributes the work across multiple locations (often deployed on a grid) so that many threads are working independently and simultaneously. NoSQL uses the concept of maps which groups multiple index values allowing for a single map to handle a dynamic set of queries based on many attributes. NoSQL allows for versioning of records. By time-stamping changes, new records are added to the database without the overhead that updates and deletes have in a RDBMS.

It should be should be pointed out that the idea of RDBMS slower than NoSQL is not always true. Let’s take a case of Analytics and business intelligence; Businesses mine information in corporate databases to improve their efficiency and competitiveness, and business intelligence (BI) is a key IT issue for all SMBs to large companies. NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL. Some respite is provided by the surfacing of solutions such as HIVE or PIG, which can provide easier access to data held in Hadoop clusters and perhaps eventually, other NoSQL databases.

For decades Database administrators have relied on scale up — buying bigger servers as database load increases — rather than scale out — distributing the database across multiple hosts as load increases. However, as transaction rates and availability requirements increase, and as databases move into the cloud or onto virtualized environments, the economic advantages of scaling out on commodity hardware become irresistible. NoSQL databases are designed to expand transparently to take advantage of new nodes, and they’re usually designed with low-cost commodity hardware in mind.
To sum up NoSQL databases are becoming an important part of the database environment, and when used appropriately, can provide significant performance benefits.

Reference:
http://stu.mp/2010/03/nosql-vs-rdbms-let-the-flames-begin.html
http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772
http://news.ycombinator.com/item?id=1221598
http://arifn.web.id/blog/2010/05/05/nosql-the-end-of-rdbms.html

*If you find something is misleading or not correct then please throw some light on it.

Nav view search

Navigation

Search

Database systems with MapReduce

Origins of MapReduce

MapReduce Implementation

Conclusion

Distributed Space based architecture

Connectome as a template for Network, Cognitive and Cloud systems

Patterns for Cloud Computing

1. Storage patterns

• Structured Storage

• Un-Structured Storage

• Flyweight

2. Compute patterns

• Multi-tenancy

• Big Data Processing

3. Administration / Management patterns

• Design for Operations

• Cloud Deployment

• Cloud Broker

4. Communication Patterns

• Messaging

Database scaling with No SQL

More Articles ...