EnterpriseDB's latest play to become a leading enterprise database vendor has seen the company reveal the EDB Postgres platform.The open source vendor has chosen to embrace a wide ecosystem of vendors and technologies in order to show that open source is the way forward for enterprise databases.Linster said: "Instead of saying all data has to live in the same infrastructure, we understand that sometimes data is better off sitting elsewhere."In addition to the software connections that EDB is making, it has also developed a partner ecosystem with leading hardware vendors such as IBM, HPE, and Dell.New capabilities that have been added include cloud management support for private cloud use, particularly OpenStack.Linster said that the idea is to simplify management and deployment and to get better access to other vendors, "we don't live in isolation," he said.
Manufacturers are in a particularly good position to benefit from the supply chain visibility that the tools can provide throughout the product life cycle.And with the rapid advancement of the Internet of Things IoT —which is partially enabled by new open-source platforms—visibility can extend beyond a company s four walls and into the field as customers use the products.As early adopters of robotics, process controls, supply chain optimization, test automation and other advanced applications of statistics, manufacturers long ago recognized the importance and benefits of data management and analysis.In fact, open-source technologies have drastically changed the economics of large-scale data storage and processing in these four ways:Commodity and Cloud-Based ClustersWhere floor space is at a premium, cloud-based platforms provide a viable off-premise option for compute and storage.This schema-on-read capability is especially helpful for rapidly evolving data structures.The resulting insights allow for the rapid identification and remediation of defects, optimization or even reduction of testing routines, and the opportunity to tailor a product for a given customer, based on observed usage patterns.
Cray has always been associated with speed and power and its latest computing beast called the Cray Urika-GX system has been designed specifically for big data workloads.It also includes its own graph database engine called the Cray Graph Engine, which the company claims is ten to 100 times faster than current graph solutions running complex analytics operations.While the customer still deals with applications built on top of the platform, Cray will handle all of the big-picture stuff and work with the customer s IT department on the rest.While it s all well and good to say you ll take care of the software maintenance, it gets tricky when the customer is building stuff on top of the software you installed and the vendor is responsible to make sure it all works.Cray s Ryan Waite, senior vice president of products, insists that Cray has a long history of working closely with its customers and can handle whatever grey areas may arise.In other words, they have to compete, so the multi-million dollar price tags of yesteryear are long gone.
Cray Urika GX: Aims to combine the best of the GD and XA but in a smaller format.According to Cray's VP of business operations EMEA, Dominik Ulmer, this is not exactly new ground for Cray.The company has been in the analytics business for the last four years.So if you really want to have a competitive advantage based on data-driven decision making then, "you have to make them fast and at a high frequency and in as flexible a way as possible," he said."That means doing high level data analysts with standard tools like Hadoop and Spark, along with something that we had on our Urika GT system - special, purpose-built hardware with graph analytics on top."He believes that this will help researchers go deeper in order to discover unknown patterns and new dependencies and relationships.
Hadoop, the free Java-based programming framework, is designed to support the processing of large data sets in a distributed computing environment that is typically built from commodity hardware.At its core, Hadoop consists of a storage part called the Hadoop Distributed File System HDFS , and a processing part called MapReduce.Basically, Hadoop works by splitting large files into blocks which are then distributed across nodes in a cluster to be processed.The base framework is made up of Hadoop Common, which contains libraries and utilities for other Hadoop modules; HDFS, a distributed file system that stores data on commodity machines; YARN, which works as a resource management platform; and MapReduce, which is for large scale data processing.The MapReduce and HDFS components of Hadoop were originally inspired by Google papers on their MapReduce and Google File System, the paper was published in 2003.Java is the most common language on the Hadoop framework, although there is some native code in C and command line utilities written as shell scripts.
A new Azure service that supports databases of up to 60TB, versions 250GB to 1TB for Azure SQL, but optimized for data warehousing with massively parallel processing for queries but a more limited subset of T-SQL available.For example, a sales person might only be allowed to see their own sales records, despite other records existing in the same tables.Dynamic Data Masking is another new feature.A simple concept, but one that enables easy monitoring and troubleshooting of performance issues.Entire tables can be migrated to Azure, or if a table has a mix of current and historical data, you can create a function to determine whether a row will be migrated.The ScaleR library, developed by Revolution Analytics, includes algorithms for data import; sorting, merging, and splitting; statistical functions and cross tabulation; data visualization; modeling algorithms and decision trees.
While this data on the population is currently stored in siloed and disparate databases, connecting it could make it possible to automatically follow individuals' records across all of the Home Office's many directorates, from the two years' worth of car journeys logged in the ANPR data centre, to the passports database, the police databases, and many others.After laying off over a third of its old IT staff, the Home Office has recently been attempting to recruit Hadoop specialists to help it build and maintain this new single platform , with a presentation and talk seemingly doing the rounds around the user circuit until the Home Office got spooked by The Register.The other speaker at the HUGUK meeting, the head of strategy and architecture, Simon Bond, recognised this and offered a slide suggesting the scale of those databases.The TPT's crucial work , as a Home Office spokesperson described it, included taking greater direct control over the design, delivery and operation of technology systems; standardising, integrating and reusing solutions across services and developing a broader supplier base, including niche expert suppliers.Such niche expert suppliers are likely to include San Jose-based Hortonworks.At that time Hortonworks was the only accredited Hadoop support company listed on the government's procurement platform G-Cloud and so the contract was awarded to them without the contract tender for a proof-of-concept going public.
The financial services market has been undergoing a fundamental change as customers demand better service, visibility and ease of use.The bank has opted not to build core IT systems from scratch, instead opting to use commoditised banking software from FIS and layering integrations within middleware.From the traditional high street retail banks such as Barclays, Lloyds, HSBC and Royal Bank of Scotland, there wouldn't appear to be a lack of interest in technology, perhaps just a slowness to widely deploy.Peter Simon, head of information at Barclays said: "to process across all our small business customers on a daily bases it's about six weeks work of processing data."Their future success will depend upon external collaborations that improve business operations, particularly those that introduce systems to support new products and better customer service."Citi for example has invested in Visible Alpha, a start-up offering a platform for aggregating and interpreting sock analyst models and forecast data.
Although San Jose-based MapR tips its hat to MapReduce by name, the increasing obsolescence of Google's 2004 framework – and the public enthusiasm for Apache Spark as its successor – has provoked the company into developing its own enterprise-grade Apache Spark Distribution.This will include the complete Spark stack the company says, alongside its own IP in what it terms the MapR Converged Data Platform, to offer customers speedy in-memory processing, speedier app development, and code reuse across those applications.MapR is also going to include its Spark Distribution in its plug-and-play "Quick Start Solution" Hadoop offerings, which first came out last year to flog pre-built templates, configuration, and installation help.Corresponding with El Reg, Jack Norris, MapR's new senior veep for data and applications, said: "There is a lot of excitement in the developer community around Spark."MapR is seeing more growth in its free on-demand training classes, which relate mainly to Spark, and Norris added: "Developers talk about the ease of development in Spark and say the streaming analytics options are very strong."Norris said that a "hybrid open source model that can combine architectural innovations while supporting industry standard APIs and supporting the full rich open source community is the best model for meeting customers needs."
Microsoft today announced that it is making a serious commitment to the open source Apache Spark cluster computing framework.After dipping its toes into the Spark ecosystem last year, the company today launched a number of Spark-based services out of preview and announced that the on-premises version of R Server for Hadoop which uses the increasingly popular open source R language for big data analytics and modeling is now powered by Spark.In addition, Microsoft announced that R Server for HDInsight essentially the cloud-based version of R Server is coming out of preview later this summer and Spark for Azure HDInsight is now generally available with support for managed Spark services from Hortonworks.Power BI, Microsoft s suite of business intelligence tools, will now also support Spark Streaming to allow users to push real-time data from Spark right into Power BI.All of these announcements mark what Microsoft calls an extensive commitment for Spark to power Microsoft s big data and analytics offerings.Microsoft, as well as Google, Baidu, Amazon, Databricks and others, will feature prominently at the Spark Summit in San Francisco this week.
IBM reckons the rigs assembled to run the likes of Hadoop and Apache Spark are really just supercomputers in disguise, so has tweaked some of its supercomputer management code to handle applications that sprawl across x86 fleets.As explained to The Register by IBM's veep for software-defined infrastructure Bernie Sprang, apps resting on clusters need to optimise workloads across pools of compute and storage resources, and can benefit from templates that make it easier to deploy without dedicated hardware.That second point, Sprang says, is important because he's starting to see cluster creep , a phenomenon whereby different teams inside an organisation each cook up their own compute clusters that could perhaps be shared instead of hoarded.There's also the Spectrum LSF tool for workload scheduling.It does think the Spectrum range is a fine idea for those contemplating cloudy or hybrid cloud analytics rigs, as it will happily span on-premises and public clouds.But that's where the similarities end: this lot is aimed squarely at clustered apps and Big Blue hopes its high-end pedigree will interest those now wrestling with hyperscale workloads.
I can't help but have some biases from my perspective, but I do my best not to be a pitch man, but rather to think about this broader community, and I think Cloudera recognises that that's in Cloudera's interest that we have a vibrant diverse community of vendors.There is, of course, also the ODP's interest in Ambari – an open source management platform which is also a direct competitor to Cloudera's proprietary offering in Cloudera Manager, which Cutting unsurprisingly reckons is considerably more advanced .It isn't their data, it's rather something that helps them manage their data and manage their services, their open-source software stack.Hadoops like yellow elephantsThough invented by Cutting, one of Cloudera's largest rivals, Hortonworks, claims to contribute more to Apache Hadoop than its competitors and sells itself on this front.If you look at it ecosystem-wide, you know, Hortonworks and Cloudera are the largest contributors and you can find different metrics to make each shine depending on whether you look at lines of code, or numbers of bugs fixed, or whether you look a twenty projects wide, or three projects wide, and which twenty and which three, you can cook the books one way or another.Our open source strategy has not changed much, not many of these things I'm talking about have changed since 2009 since when we set on this path, and that was several years before Hortonworks was founded, he said.
Doug Cutting: "If you could have a petabyte of data in memory, accessible from any node within cycles, that's several layers of magnitude performance improvement."When Doug Cutting created the Hadoop framework 10 years ago he never expected it to bring massive-scale computing to the corporate world.While XPoint will initially be offered as storage in the form of Optane-branded SSDs, Intel is planning to follow that up by releasing XPoint memory modules.Read MoreRegardless, Cutting predicts the use of XPoint and other non-volatile memory in Hadoop clusters will open up the platform to new uses, allowing users to process much larger datasets in memory, which in turn will bypass the latency inherent in fetching data from disk.However, there are still limitations that Cutting says need to be addressed to make the process easier, with Cloudera planning to improve support for feeding data from AWS S3 and other cloud-based block storage to Hadoop's data processing engines.Looking further into the future of distributed systems, Cutting says an architecture is needed that can instantaneously consult both real-time and historical data to help make real-time decisions.
The Hadoop distribution war comes down to a final battle between Cloudera s CDH and Hortonworks HDP.Indeed, it s competition that leads to end users getting the best possible products in their hands.In June of last year Derek Wood, a DevOps Engineer at Cask, wrote a blog showing which versions of various software packages were supported by which versions of HDP and CDH.Over the course of the past year I ve become increasingly concerned that the Apache Spark ecosystem will go the way of Hadoop before it.Although Apache Spark is just four years old, we re already at the point where a few vendors are looking to sell Apache Spark to customers in different formats.Open-source ideals are frequently sacrificed on the altar of creating an easily packaged product that can be sold to generate short-term profits.
Find out a little bit more about this open source big data tool.The popular open source big data processing framework Apache Spark has become one of the most talked about pieces of technology in recent years.The popularity of the framework, which is designed around speed and ease of use, has seen the likes of IBM, Microsoft, and others align their own analytics portfolios around the technology.Built on top of Hadoop MapReduce it extends this model in order to use more types of computations including, Interactive Queries and Stream Processing.As a standalone deployment Spark sits on top of Hadoop Distributed File System so that space is allocated for HDFS.Apache Spark can be downloaded from the Apache Software Foundation site which lists numerous Spark releases and the type of package so that users can find the right version for their purposes.
Big data's impact on businesses has resulted in the creation of new C-level roles and posed the challenge of developing skills to be able to keep up with the demand for data analytic processing and the tools that have swamped the market.To focus on data strategy is a sensible place for many of the tech vendors in the Hadoop ecosystem to go, after all, making money from consultancy style services would add another source of revenue in addition to the software, training and certification revenues which they already receive.While all to some extent had their own product upgrades or new releases recently available, the theme of the conversations was predominantly around strategy.Clarke Patterson, senior director, product marketing, Cloudera told CBR that over the next several months the company would be talking more about a "journey to success" by helping businesses understand the people, process, and technology.Greg Hanson, VP business operations, EMEA, Informatica, told CBR: "One of the things we've always said, in order to be successful in any project you need to have a combination people process and technology, obviously we have all the technology but it's also about the people.""I think for the first time what we are seeing is that these roles exist at the C-level and that's important because now they are starting to exist at a board level they can really start being the agent of change and make that move in organisations to data 3.0," said Hanson.
"If you have a traditional database like Oracle or MySQL, it's scale-up, and there's always the notion of a durable log," said Tarun Thakur, Datos IO's co-founder and CEO."There is no concept of a durable log because there is no master -- each node is working on its own stuff," Thakur explained.Specifically, to offer scalability while accommodating the crazy amounts of diverse data flying at us at ever-more-alarming speeds, today's distributed databases have departed from the "ACID" criteria generally promised by traditional relational databases.Earlier this month, Datos IO launched RecoverX to address those concerns through features including what it calls scalable versioning and semantic deduplication.Souvik Das, who until recently was CTO and managing vice president of engineering with CapitalOne Auto Finance, has felt the backup crunch first-hand.After years of using traditional databases, CapitalOne underwent a "massive transformation" a few years back that included rolling out new distributed technologies such as Cassandra, said Das, who is now senior vice president of engineering at healthcare-focused startup Grand Rounds.
It's all about 'composable' infrastructureA data centre imageBackgrounder DriveScale is a startup that emerged from a three-year stealth effort earlier this year with hardware and software to dynamically present externally connected JBODS to servers as if they were local.It is meant to share the characteristics of hyper-scale data centres without involving that degree of scale and DIY activity by enterprises.Servers and storage should be managed as seperate resource pools.Founding, founders, and fundingDriveScale received seed funding of $3m when it was founded in 2013 by chief scientist Tom Lyon and CTO Satya Nishtala.They were founders of the Nuovo spinout which developed Cisco's UCS server technology.At Sun Lyon worked on Sparc processor design and SunOS while Nishtala was involved with Sun storage, UltraSparc workgroup servers and workstation products.
Promo If keeping up with the volume, variety and velocity of information is a data scientist's biggest challenge, keeping on top of the tools and methods to do this must come fairly close.So if you're struggling to keep up, or simply want to reality check your current setup, you'll be pleased to know that IBM Cloud Data Services brings together a variety of tools and resources to help you do just that.If you want to dive right in, you can begin trials of IBM Analytics for Apache Spark, with integrated Jupyter Notebooks.This is a managed service so you can get down to business even quicker.Of course, the data has to be usable, so you can also try out IBM DataWorks, Big Blue's managed self-service data prep and movement service.And looking into the future, you can pre-register for the upcoming IBM BigInsights on Cloud, which will allow you to spin up Apache Hadoop clusters in minutes.
Fear not: The lingo needn't be mysterious.Take one major trend spanning the business and technology worlds, add countless vendors and consultants hoping to cash in, and what do you get?In the world of big data, the surrounding hype has spawned a brand-new lingo.It refers to "data whose utility is going to decline over time," said Tony Baer, a principal analyst at Ovum who says he coined the term back in 2012.It's things like Twitter feeds and streaming data that need to be captured and analyzed in real time, enabling immediate decisions and responses."Fast data can refer to a few things: fast ingest, fast streaming, fast preparation, fast analytics, fast user response," said Nik Rouda, a senior analyst with Enterprise Strategy Group.