Linear scalability, the bottleneck of SQL databases

Introduction

Today, companies are storing more data compared to years ago, which creates a need for systems capable of storing and processing so much information. The data generated and stored by companies has been exponentially growing during the last years. By 2025, it is estimated that 463 exabytes of data will be generated each day globally. The best-known technology to store and process data is a database. However, traditional databases cannot manage that huge amounts of data. An alternative exists, called NoSQL, but it includes multiple problems:

  • The interfaces of the NoSQL platforms are different from traditional SQL databases, which implies a significant shift for a company’s product.

  • NoSQL databases are typically hard to configure for large deployments.

  • Most NoSQL databases do not guarantee ACID transactions, which impacts how these databases are used by applications.

  • The few that do guarantee ACID transactions have scalability problems. In many cases, the system might not be capable of growing as much as the workload requires.

LeanXcale is a database that solves these problems without impacting current systems:

  • LeanXcale offers an SQL interface that follows the query language standard to make it easier to adapt to the new database system. Companies won’t need to spend much effort refactoring products to adapt to this scalable database.

  • Being easy to configure, LeanXcale offers an on-premise instance entirely configured and ready for use out of the box.

  • LeanXcale guarantees full ACID transactions, so you won’t need to adapt your product.

  • Based on its patented system, LeanXcale can scale as much as the workload requires while maintaining full ACID transactions, so companies can grow as much as needed.

In this article, I test the horizontal scalability of the LeanXcale database using the TPC-C benchmark.

LeanXcale characteristics

LeanXcale is comprised of components with distributed capabilities. All are designed and implemented to be distributed and highly available, allowing LeanXcale to run either on a single instance or many more. These components are distinguished as Meta or Data node instance types.

post1.png
post2.png

For the scalability benchmark applied here, the Data node is most relevant because it allows us to distribute the storage engine across multiple instances. The components of the Kivi Data Store (KVDS) and Query Engine can run while distributing the load across multiple instances. Kivi, the LeanXcale storage engine, can be fully distributed as all the tables can be split and distributed across many KVDS instances. Consequently, the storage engine can run on many nodes without any tradeoffs. At the same time, the Query Engine can run as many instances as is needed depending on the workload of the system as well as maintain full ACID transactions while scaling. In addition, it can coordinate work for large queries by using processing units on each Query Engine instance. With this, the Query Engine can run HTAP workloads keeping full ACID transactions. For this demonstration, we run the TPC-C benchmark that makes use of the OLTP (Online Transaction Processing) transactions.

As described above, LeanXcale is a distributed database, and it offers disruptive features. Scalability is the most relevant for this benchmark, although we also use the dual interface through which LeanXcale provides a key-value interface offering fast insertions, gets and scans, and an SQL interface through a JDBC driver. This key-value interface populates the database and prepares all data required to run the clients of the benchmark. Through the JDBC interface, the client of the benchmark runs the SQL queries.

You can get a more in-depth description of LeanXcale capabilities here.

The TPC-C Benchmark

The TPC-C benchmark is an OLTP benchmark proposed by the Transaction Processing Performance Council to emulate sales in a large company represented by different warehouses. The number of warehouses has a direct relation to the number of clients (with ten clients per warehouses) querying the system and the size of the database (approximately 500k tuples per warehouse). This benchmark can run with different data and transaction sizes by increasing the number of warehouses and clients. The performance metric reported by TPC-C measures the number of orders that can be fully processed per minute, expressed in tpmCs.

Figure 1. Illustration of the number of rows per table.

Figure 1. Illustration of the number of rows per table.

Figure 1 includes a diagram with the dependencies between the tables used by the TPC-C benchmark. All the tables have a fixed size determined by the number of warehouses, except for Order line, History, and Order. These three tables have a set size when the benchmark starts, and then grow as the benchmark progresses.

This benchmark simulates different user queries, each running some linked queries with the following percentages:

 
post4.PNG
 

To test the scalability of LeanXcale, we run a distribution with a different number of data nodes, from 1 to 100. The meta node and all data nodes used for this demo are homogeneous r5d.xlarge AWS instances and the TPC-C clients are started on t2.xlarge AWS instances.

On each data node, we populate the benchmark with 200 warehouses, then run the benchmark with 1, 20, and 100 data nodes. Each data node includes a Query Engine responsible for managing the transactional workload for a fraction of the clients. The data nodes also have an instance of the storing engine, KVDS. Each KVDS instance oversees a fraction (200 warehouses) of the entire dataset.

The result of this benchmark is the number of new order operations (around 45% of the total operations of the benchmark) and is expressed as tpmCs. The benchmark passes if the number of tpmCs is around 12,5 tpmCs per warehouse.

The following chart shows the tpmCs obtained with the following configurations:

  • 1 data node with 200 warehouses: 2.500 tpmCs
  • 20 data nodes with 4.000 warehouses: 50.000 tpmCs
  • 100 data nodes with 20.000 warehouses: 250.000 tpmCs
post5.PNG

With this experimentation, we confirmed that LeanXcale has scale-out properties by proving that with 100 data nodes, its performance increases 100 times than with just a single data node. This makes LeanXcale capable of processing any transactional workload by dimensioning the system.

The number of transactions of the system is relevant for the TPC-C benchmark as well as the response time. With LeanXcale, the average response time (during 15 mins) is independent of the number of warehouses and data nodes in the deployment. The following chart shows the response time during the last 15 minutes of the TPC-C benchmark. As can be seen, the response time maintains a maximum of 80 ms and a minimum of 45 ms.

post6.PNG

Now that we demonstrated the linear scalability of LeanXcale, we devote future posts to test other LeanXcale features, such as real-time or geohash, but in the meantime you can get more info about LeanXcale from their whitepapers here.

WRITTEN BY:

DBS.jpg

Diego Burgos Sancho, Software Engineer at LeanXcale

After working at the Distributed Systems Lab at the Technical University of Madrid as a computer scientist, I joined LeanXcale to build one of the most interesting distributed systems ever developed. At LeanXcale, I’m performing research through an industrial PhD at the Technical University of Madrid.

diego.burgos@leanxcale.com

https://www.linkedin.com/in/diegobursan/

GIS support on LeanXcale

Geospatial data refers to objects (in the wide sense of the object word) that may have a representation on the surface of the earth. These representations can be locations (points), paths, areas, or any kind of information that is capable of being represented in a map. As these kinds of objects may be static in the short term or dynamic in real time, geospatial data combines objects with metadata, such as specific attributes or temporal information, to try to build a complex environment ready to be analyzed. In this context, the concept of geospatial big data arises. Today, because of databases’ limitations (among other reasons), it is difficult to fully exploit a significant volume of geospatial data in many applications.

LeanXcale offers support for geospatial data through the use of specifically created ST functions to handle geospatial data using SQL. Combined with LeanXcale’s capacity for bottleneck elimination in transactions and linear instance scaling, we now have the capability to store and load geospatial data to and from LeanXcale in a single instance, suppressing the need for ETLs and data transport between instances.

To try to illustrate this new GIS support feature, we have developed a GIS support demo, which I would like to share in this post. It is a simple demo, in which we are going to store geospatial data (specifically, geospatial points) in LeanXcale and visualize them in real time over a map.

Currently, we support Geometry and Geohash functions, as described in the wikis. We are using WKT, an ANSII standardized codification, to store geospatial vector geometry objects.

COMPONENTS

We need:

  • A LeanXcale (LX) database instance. You can find it here.
  • A business intelligence tool to visualize geospatial data.
  • A nice dataset.
  • A program that simulates geospatial data insertion on LX.

LeanXcale will soon be released with a fully integrated BI tool, based on Apache Superset. Thus, LeanXcale will offer a SQL client integrated on its own and with a very complete set of visualizations ready to easily be created by a non-expert user with some SQL knowledge. So, the chosen BI tool to visualize data is Apache Superset.

In terms of obtaining a nice dataset, after some research, we found this work research group with access to some datasets handling geospatial data. Mobile Century was an experiment conducted at UC Berkeley to test the potential to use GPS data to estimate traffic conditions. The dataset contains 8 hours of GPS trajectory data from 100 vehicles on a ~10-mile stretch of I-880 in California, as well as inductive loop detector data from PeMS, and travel times recorded by license plate recognition. What we are going to simulate is real-time movement of a subset of vehicles (20 vehicles) according to this dataset.

In terms of a program that simulates geospatial data insertion, we chose Java with Spring Boot to load CSV data from the Mobile Century dataset. A state machine has been implemented with states running, stopped and to-clear, and it replies to http get calls to run, stop, or restart the demo.

ARCHITECTURE

So, let’s go. We are going to benefit from LeanXcale’ s versatility, in which we can use either a No-SQL or a SQL interface. The data loader will use the LeanXcale supplied library for No-SQL inserting, handling tuples instead of SQL queries. Conversely, the Superset BI tool will use SQL queries to extract the data to be visualized. So the architecture follows the next schema:

post_image_1.png

DATA LOADER

Let’s do a little code design for GISDataLoader. Loader will create a thread pool in which each thread will take a vehicle position data CSV, read it, and insert a new tuple given a period of time. This way, we can simulate data ingestion on LeanXcale.

As previously described in our streaming twitter post, the KiVi library is not available from a public repository, so we have to install the LX all-dependencies jar in our local repository and define the dependency as follows:

mvn install:install-file -Dfile=kivi-api-0.96-SNAPSHOT-direct-client.jar -DgroupId=com.leanxcale -DartifactId=kivi-api -Dversion=0.96-SNAPSHOT -Dpackaging=jar

    <dependency>
      <groupId>com.leanxcale</groupId>
      <artifactId>kivi-api</artifactId>
      <version>0.100-SNAPSHOT</version>
    </dependency>

We also need Geotools for Java, so let’s add the necessary repos and dependencies:

<repository>
            <id>osgeo</id>
            <name>Open Source Geospatial Foundation Repository</name>
            <url>http://download.osgeo.org/webdav/geotools/</url>
</repository>
<repository>
          <snapshots>
            <enabled>true</enabled>
          </snapshots>
          <id>boundless</id>
          <name>Boundless Maven Repository</name>
          <url>http://repo.boundlessgeo.com/main</url>
</repository>
<dependency>
            <groupId>org.geotools</groupId>
            <artifactId>gt-shapefile</artifactId>
            <version>23-SNAPSHOT</version>
</dependency>
<dependency>
            <groupId>org.geotools</groupId>
            <artifactId>gt-swing</artifactId>
            <version>23-SNAPSHOT</version>
</dependency>

Code design, without going into extensive detail at this point, follows the next diagram. Note that code is prepared for easily adding new dataset loaders by adding configuration to properties files and implementing the MetadataCreator interface (to create the database schema) and DataLoaderInterface (to properly load the data). If we wished to store more CSV-based data, it would only be necessary to extend abstract classes and implement concrete abstract methods.

You can find source code here.

post_image_2.png

An important thing to note is the use of the No-SQL API. When implementing createTable, we have to make use of the KiVi library, in this way:

@Override
    public void createTable(Session session) {
        // Gets the database object (to access the schema)
        Database database = session.database();

        // Checks the existence of the table
        if (!database.tableExists(getTable())) {

            // If the table does not exist create it.
            database.createTable(getTable(), 
                    Collections.singletonList(new Field("ID", Type.LONG)), // PK
                    Arrays.asList(new Field("VEHICLE", Type.STRING), new Field("MOMENT", Type.TIMESTAMP), 
                            new Field("LOCATION", Type.STRING),new Field("SPEED", Type.DOUBLE)));
        }
    }

    @Override
    public void setTable() 
        
    

    @Override
    public void setSequence() 
        
    

This code corresponds to the following SQL sentence:

create table GPS (
  ID BIGINT NOT NULL,
  VEHICLE VARCHAR NOT NULL,
  MOMENT TIMESTAMP NOT NULL,
  LOCATION  VARCHAR,
  SPEED NUMERIC(3,3),
  CONSTRAINT PK_GPS PRIMARY KEY (ID)
);

Besides, the GPS_ID_SEQ sequence is automatically created in the abstract parent class.

When implementing the row2Tuple method, in addition to making use of the GeoTools library, we have to again make use of the KiVi library, in this way:

@Override
    public Tuple row2Tuple(String path, String[] rowParts, Session session) {
        // CSV FORMAT: unixtime(long);latitude(double);longitude(double);speed(double)
        // Position in WKT
        GeometryFactory geometryFactory = new GeometryFactory();
        Coordinate coords = new Coordinate(Double.parseDouble(rowParts[1]),Double.parseDouble(rowParts[2]));
        Geometry point = geometryFactory.createPoint(coords);
        WKTWriter writer = new WKTWriter();
        String position = writer.write(point);
        // Get next sequence value
        Sequence idSeq = session.database().getSequence(getSequence());
        long nextSeqVal = idSeq.nextVal();
        // Create timestamp
        Timestamp t = new Timestamp(Long.parseLong(rowParts[0]));
        // Get vehicle id
        Pattern pattern = Pattern.compile("\\w+/\\w+/(\\w+).\\w+");
        Matcher matcher = pattern.matcher(path);
        String vehicleId = null;
        if (matcher.matches()) {
            vehicleId = matcher.group(1);
        }       
        // Generate tuple to insert
        Table table = session.database().getTable(getTable());
        Tuple tuple = generateTuple(table, nextSeqVal, vehicleId, t, position,
                Double.parseDouble(rowParts[3]));
        // Insert tuple
        System.out.println("Thread inserting record from csv " + path + " with id " + nextSeqVal);
        return tuple;
    }

The tuple format matches the previously created table structure.

SUPERSET BI TOOL

As previously indicated, new LeanXcale releases will be enriched through the addition of a packaged and ready-to-use Superset deployment. Meanwhile, Superset can be installed in a separate server by following these steps.

Next, we have to install the Python driver for LeanXcale. Supserset runs with Python 3, so we have to install the driver by using the pip3 command and executing the following:

pip3 install pyLeanxcale-0.12.0-py3-none-any.whl

The Python driver for LeanXcale can be downloaded from here.

The next step is to configure LeanXcale access from Superset. Navigate to “Sources/Databases”, click on the “+” button, and fill in the datasource name (local-lx-mobilecentury) and the SQLAlchemy URI (leanxcale://APP@:1529/mobilecentury). Mark all ticks except “Asynchronous Query Execution” and save the changes.

post_image_3.png

To configure the table, please navigate to “Sources/Tables”, click on the “+” button, and fill in the fields with the recent created database connection, the schema (APP), and the table name (GPS). Save the changes.

Let’s now configure the chart to visualize vehicle movements in real time. First of all, we have to define the SQL query that Superset is going to execute. We will use the Geometry functions provided by LeanXcale, and this is the query:

select vehicle, moment, ST_X(ST_PointFromText(location)) AS lat,
ST_Y(ST_PointFromText(location)) AS lon from gps where id in (select MAX(id) from gps group by vehicle)

We are obtaining the last inserted row using the auto generated id field. Please note that this query is making use of LeanXcale implemented geospatial functions. Specifically, we are building a Point type from a WKT text before extracting X and Y coordinates.

So, using Superset, let’s visualize the query results. Navigate to “SQL Lab/SQL Editor”. On the left part of the view, fill in the appropriate values (Database=local-lx-mobilecentury, Schema=APP, Table Schema=GPS). This will yield a data preview. Next, type the query, and click on “Run Query”:

post_image_4.png

To export query results and create a chart to visualize, click on the “Explore” button. The view to configure a chart will be displayed.

post_image_5.png

Click on the button under “Visualization Type” (now it is displaying “Table”) and select “deck.gl Scatter plot”. This is a preconfigured visualization to draw position in maps, which uses Mapbox. Actually, it is necessary to request an API key for Mapbox and register it for Superset with the name “MAPBOX_API_KEY”. As it is an environment variable, you can set its value wherever you wish.

Once “deck.gl Scatterplot” is selected, fill in the fields in the left part of the view with the appropriate values. The important ones are latitude and longitude:

post_image_6.png

You can then play with the rest of the options to change visualization, map zoom, point radius, and colors, etc. In this demo, we are choosing the following options:

-Without auto zoom.

-Point size: 150, radius in meters.

-Point color: categorical, with Dimension Vehicle.

When you have finished, click on the “Run Query” button. Select appropriate zoom over the map and click on “Save”. We will save this chart with “2008 MOBILE CENTURY EXPERIMENT GPS TRAJECTORY DATA” as the name and also add it to a dashboard with the name “GISDemo”.

post_image_7.png

After clicking on “Save and go to dashboard”, we are redirected to the dashboard page, on which we can set the chart size, among other things. To change some other settings, it is necessary to save the dashboard first, so give it a name, and click on “Save” on the right corner.

Now that the dashboard has been saved, let’s set the dashboard refresh interval. Click on “Edit dashboard”, and then click on the arrow to the right of the “Switch to view mode” button and select “Edit dashboard metadata”. Note that there is a field to set the auto refresh interval. By choosing this option, however, we can only select from a set of values. So, let’s edit the metadata and configure the desired number of seconds to refresh.

post_image_8.png
post_image_9.png

Modify the refresh frequency to 2 seconds in the “JSON Metadata” field and save the changes.

We’re almost done! Now run the loader:

mvn spring-boot:run

or

mvn clean package
java -jar GISDemo-0.0.1-SNAPSHOT.jar

Next, with your favorite HTTP client, send a clean request for creating the database structure:

http://dataloader_server_ip:8080/GISDemo/clean

Next, send a run request:

http://dataloader_server_ip:8080/GISDemo/run

The run request will start the insertion, and, at the same time, we can visualize the movements on our dashboard.

post_image_10.gif

You can stop the demo whenever you want with the following request:

http://dataloader_server_ip:8080/GISDemo/stop

You can then recreate the database structure, using the clean request, whenever necessary.

 

Written by

SEG.jpg

Sandra Ebro Gómez, Software Engineer at LeanXcale

10 years working as a J2EE Architect in Hewlett-Packard and nearly 2 years working in EVO Finance and The Cocktail as a Solution Architect. Now part of LeanXcale team, trying to look for and develop the best usecases for high escalable databases.

sandra@leanxcale.com

 https://www.linkedin.com/in/sandra-ebro-g%C3%B3mez-8b139960/

Tweet sentiment analysis on a LeanXcale database

In previous posts, we showed how to stream Twitter data into a LeanXcale database using a direct API. In today's post, I will demonstrate how to perform analytical queries using SQL to later use natural language processing (NLP) techniques to perform sentiment analysis on tweets. This post aims to provide a simple example of sentiment analysis using the LeanXcale database, and it is inspired by this blog post, where a similar analysis is done reading from a MySQL database.

All code is written at the end of the post and is available in our git repository (https://gitlab.com/leanxcale_public/sentimentanalysislx).

Prerequisites

To follow this post, you will need a LeanXcale database populated with tweets in the same way as in our previous post. You can access a LeanXcale instance from our website.

You will also need Python 3.7 and the LeanXcale SQLAlchemy driver, which can be downloaded from here. Also, pip for Python 3 must be installed to install the required packages.

sudo apt update 
sudo apt upgrade
sudo apt install python3.7 python3-pip

The requirements and the LeanXcale python driver can be installed using pip by running:

pip3 install file.whl
pip3 install -r requirements.txt

In addition, stop words and wordnet from NLTK must be downloaded:

python3
import nltk
nltk.download(‘stopwords’)
nltk.download('wordnet')
quit()

For the setup of a LeanXcale database and population with a Twitter stream, please go through our post on streaming Twitter data into a LeanXcale database. If you get the following error “ImportError: No module named _tkinter, please install the python-tk package” when running the code, you need to install the tkinter package: apt install python3-tk.

LeanXcale connection

To connect our Python code to the populated LeanXcale database, the previously installed SQLAlchemy driver must be imported. From this package, we’ll need the create_engine, MetaData, and Table packages as well as SessionMaker from sqlalchemy.orm.

from sqlalchemy import create_engine, MetaData, Table
from sqlalchemy.orm import sessionmaker

If we remember from our previous post, a Twitter stream was stored in a database called ‘twitterdb’ and a table called ‘TWEETS’. Additionally, to connect, we will need our username, which is ‘APP’. The connection string used by SQLAlchemy has the following format: 'leanxcale://APP@10.64.21.1:1529/twitterdb’, where the IP address and port is the endpoint where the LeanXcale query engine is running. You will have to change the IP to your current installation endpoint. With this connection string, an engine object from SQLAlchemy is created. Now, there are two options to connect to LeanXcale and load data into a pandas DataFrame.

Loading option 1

The first alternative is taking advantage of the SQLAlchemy Object Relational Mapper (ORM). Basically, using this ORM, we can associate predefined Python classes with database tables. The ORM’s access to the database is done through a session object bound to the engine already created. Then, we create a table object to load our LX table, binding it to our LX engine using a MetaData object. After that, we create a query for returning all rows and columns of our table. Finally, we use the pandas read_sql function to execute the query and build a DataFrame from the returned ResultSet. It is very important to close the connection opened by the engine to the LX database because we are not going to use that connection anymore.

def lx_to_df(user, database, table):
   engine = create_engine('leanxcale://' + user + '@10.64.21.1:1529/' + database)
   Session = sessionmaker(bind=engine)
   session = Session()
   meta = MetaData()
   meta.bind = engine
   tweetsTable = Table(table, meta, autoload='True')
   query_all = session.query(tweetsTable)
   df = pd.read_sql(query_all.statement, query_all.session.bind)
   print('df loaded from LX DB')
   engine.dispose()
   return df

Loading option 2

The second alternative directly uses pandas read_sql_query function, which only requires the SQLAlchemy engine bound to the LX database and the analytical SQL query to be performed. Notice that this is very flexible because it allows you to use SQL queries as you would use in any relational database.

def lx_to_df_2(user, database, SQL):
   engine = create_engine('leanxcale://' + user + '@10.64.21.1:1529/' + database)
   df = pd.read_sql_query(SQL, engine)
   engine.dispose()
   return df

Cleaning tweets

Up to this point, we have been able to connect to a LeanXcale database and retrieve all records from a table into a pandas DataFrame. Once the data is stored in a DataFrame, we can start performing NLP techniques to infer information from the tweets. In our case, we use the NLTK Python library to remove stop words and make some process of tokenization and lemmatization. Some HTML code is also removed using regular expressions.

def clean_tweets(df):
   stopwords_en = stopwords.words('english')
   # ps = PorterStemmer()
   wordnet_lemmatizer = WordNetLemmatizer()
   df["clean_tweets"] = None
   df['len'] = None
   print('cleaning tweets')
   for i in range(0, len(df['TEXT'])):
       exclusion_list = ['[^a-zA-Z]', 'rt', 'http', 'co', 'RT']
       exclusions = '|'.join(exclusion_list)
       text = re.sub(exclusions, ' ', df['TEXT'][i])
       text = text.lower()
       words = text.split()
       words = [wordnet_lemmatizer.lemmatize(word) for word in words if not word in stopwords_en]
       # words = [ps.stem(word) for word in words]
       #df.loc('clean_tweets')[i] = ' '.join(words)
       df['clean_tweets'][i] = ' '.join(words)
   df['len'] = np.array([len(tweet) for tweet in df["clean_tweets"]])
   return df

Sentiment analysis

When it comes to calculating sentiment from tweets, TextBlob is a simple Python library that offers several NLP options. The sentiment property of the TextBlob object provides a decimal value from -1 to 1, reflecting the negative or positive sentiment, respectively. To simplify the classification, a hard decision is used to define the three sentiment types:

• Positive: If the polarity is greater than 0.

• Neutral: If the polarity is equal to 0.

• Negative: If the polarity is less than 0.

def sentiment(tweet):
   analysis = TextBlob(tweet)
   if analysis.sentiment.polarity > 0:
       return 1
   elif analysis.sentiment.polarity == 0:
       return 0
   else:
       return -1

The WordCloud library is also used to plot a cloud of words for all words in tweets grouped by sentiment (that is, one word cloud for every sentiment class).

def word_cloud(df):
   plt.subplots(figsize=(12, 10))
   wordcloud = WordCloud(
       background_color='white',
       width=1000,
       height=800).generate(" ".join(df['clean_tweets']))
   plt.imshow(wordcloud)
   plt.axis('off')

Main method

In the main method, we sequentially call the already-explained methods and construct the lists containing tweets according to the sentiment classification.

if __name__ == "__main__":
   df = lx_to_df('APP', 'twitterdb', 'TWEETS')
   # SQL = 'select * from TWEETS'
   # df = lx_to_df_2('APP', 'twitterdb', SQL)
   clean_tweets(df)
   df['Sentiment'] = np.array([sentiment(x) for x in df['clean_tweets']])
   pos_tweets = [tweet for index, tweet in enumerate(df["clean_tweets"]) if df["Sentiment"][index] > 0]
   neg_tweets = [tweet for index, tweet in enumerate(df["clean_tweets"]) if df["Sentiment"][index] < 0]
   neu_tweets = [tweet for index, tweet in enumerate(df["clean_tweets"]) if df["Sentiment"][index] == 0]

   print("percentage of positive tweets: {}%".format(100 * (len(pos_tweets) / float(len(df['clean_tweets'])))))
   print("percentage of negative tweets: {}%".format(100 * (len(neg_tweets) / float(len(df['clean_tweets'])))))
   print("percentage of neutral tweets: {}%".format(100 * (len(neu_tweets) / float(len(df['clean_tweets'])))))

   word_cloud(df[df.Sentiment == 1])
   word_cloud(df[df.Sentiment == -1])
   word_cloud(df[df.Sentiment == 0])
   plt.show()

Conclusion

This post has shown how to connect to a LeanXcale database with Python using an SLQAlchemy driver and how to perform analytical SQL queries. In addition, some ORM SQLAlchemy functionalities have been introduced, which add a new way of interacting with the database. On top of that, some NLP analyses have been conducted to classify tweets according to sentiment in a simple way due to the direct integration of LeanXcale with pandas DataFrames. If you have any concern running the example or working with LeanXcale, you can contact me using the information below.

Full code

SentimentAnalysis/sentiment_analysis_basic.py

from sqlalchemy import create_engine, MetaData, Table
from sqlalchemy.orm import sessionmaker
import pandas as pd
import os
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk
from wordcloud import WordCloud
import numpy as np
import matplotlib.pyplot as plt
from textblob import TextBlob


def lx_to_df(user, database, table):
   engine = create_engine('leanxcale://' + user + '@10.64.21.1:1529/' + database)
   Session = sessionmaker(bind=engine)
   session = Session()
   meta = MetaData()
   meta.bind = engine
   tweetsTable = Table(table, meta, autoload='True')
   query_all = session.query(tweetsTable)
   df = pd.read_sql(query_all.statement, query_all.session.bind)
   print('df loaded from LX DB')
   engine.dispose()
   return df

def lx_to_df_2(user, database, SQL):
   engine = create_engine('leanxcale://' + user + '@10.64.21.1:1529/' + database)
   df = pd.read_sql_query(SQL, engine)
   engine.dispose()
   return df

def clean_tweets(df):
   stopwords_en = stopwords.words('english')
   # ps = PorterStemmer()
   wordnet_lemmatizer = WordNetLemmatizer()
   df["clean_tweets"] = None
   df['len'] = None
   print('cleaning tweets')
   for i in range(0, len(df['TEXT'])):
       exclusion_list = ['[^a-zA-Z]', 'rt', 'http', 'co', 'RT']
       exclusions = '|'.join(exclusion_list)
       text = re.sub(exclusions, ' ', df['TEXT'][i])
       text = text.lower()
       words = text.split()
       words = [wordnet_lemmatizer.lemmatize(word) for word in words if not word in stopwords_en]
       # words = [ps.stem(word) for word in words]
       #df.loc('clean_tweets')[i] = ' '.join(words)
       df['clean_tweets'][i] = ' '.join(words)
   df['len'] = np.array([len(tweet) for tweet in df["clean_tweets"]])
   return df

def sentiment(tweet):
   analysis = TextBlob(tweet)
   if analysis.sentiment.polarity > 0:
       return 1
   elif analysis.sentiment.polarity == 0:
       return 0
   else:
       return -1

def word_cloud(df):
   plt.subplots(figsize=(12, 10))
   wordcloud = WordCloud(
       background_color='white',
       width=1000,
       height=800).generate(" ".join(df['clean_tweets']))
   plt.imshow(wordcloud)
   plt.axis('off')

if __name__ == "__main__":
   df = lx_to_df('APP', 'twitterdb', 'TWEETS')
   # SQL = 'select * from TWEETS'
   # df = lx_to_df_2('APP', 'twitterdb', SQL)
   clean_tweets(df)
   df['Sentiment'] = np.array([sentiment(x) for x in df['clean_tweets']])
   pos_tweets = [tweet for index, tweet in enumerate(df["clean_tweets"]) if df["Sentiment"][index] > 0]
   neg_tweets = [tweet for index, tweet in enumerate(df["clean_tweets"]) if df["Sentiment"][index] < 0]
   neu_tweets = [tweet for index, tweet in enumerate(df["clean_tweets"]) if df["Sentiment"][index] == 0]

   print("percentage of positive tweets: {}%".format(100 * (len(pos_tweets) / float(len(df['clean_tweets'])))))
   print("percentage of negative tweets: {}%".format(100 * (len(neg_tweets) / float(len(df['clean_tweets'])))))
   print("percentage of neutral tweets: {}%".format(100 * (len(neu_tweets) / float(len(df['clean_tweets'])))))

   word_cloud(df[df.Sentiment == 1])
   word_cloud(df[df.Sentiment == -1])
   word_cloud(df[df.Sentiment == 0])
   plt.show()

Written by

Streaming Twitter data into a LeanXcale database

Recently, I read a post about streaming Twitter data into a MySQL database to filter and store tweets for further analysis.

Today, I want to share how to do the same using instead a LeanXcale database, which provides an easy key-value interface designed to ingest significant amounts of data (such as a Twitter stream) as well as an SQL interface more suitable for complex analysis.

All code is written at the end of the post and is available in our git repository (https://gitlab.com/leanxcale_public/twitter2lx).

Prerequisites

As described in the original post, we need to have a Twitter developer account and a database (in this case, LeanXcale). Several sites explain how to obtain developer access to Twitter, and you can access a LeanXcale instance from this website.

Because our project is developed with Java and Maven, we also need a library to connect to the Twitter stream, such as Twitter4j, and another library to connect to the LeanXcale key-value interface, called the KiVi interface, so we’ll use the KiVi client library downloadable from the LeanXcale website.

 Configure project

To configure the project, we must define the maven dependencies for both libraries. For Twitter4j, we point to a public maven repository:

<dependency>
      <groupId>org.twitter4j</groupId>
      <artifactId>twitter4j-stream</artifactId>
      <version>4.0.6</version>
    </dependency>

As the KiVi library is not available from a public repository, we install the all-dependencies jar in our local repository and define the dependency as follows:

 mvn install:install-file -Dfile=kivi-api-0.96-SNAPSHOT-direct-client.jar -DgroupId=com.leanxcale -DartifactId=kivi-api -Dversion=0.96-SNAPSHOT -Dpackaging=jar
    <dependency>
      <groupId>com.leanxcale</groupId>
      <artifactId>kivi-api</artifactId>
      <version>0.96-SNAPSHOT</version>
    </dependency>

 Configure the Twitter connection

Twitter4j provides several ways to configure the credentials needed to connect to its API, and we use the properties file approach. We create a file called twitter4j.properties in our resources root folder with the following content:

oauth.consumerKey = // Your consumer key
oauth.consumerSecret = // Your consumer secret
oauth.accessToken = // Your access token
oauth.accessTokenSecret = //Your access token secret

This method is simple and allows you to change the credentials without touching your code, so it’s sufficient for our example.

Configure the LeanXcale connection

Configuring and opening a session to LeanXcale through its key-value interface is straightforward, and we only need to define the session settings to create a new Session object using the endpoint URL (replace localhost with your server address in the following code):

String endpoint = "kivi:zk://localhost:2181";
String databaseName = "twitterdb";
String databaseUser = "APP";
Settings settings = new Settings();
settings.credentials(new Credentials().setDatabase(databaseName).setUser(databaseUser));
Session session = SessionFactory.newSession(endpoint, settings);

Note that the database is called twitterdb and the user is defined as APP.

Create the table

We store the tweets in a table called TWEETS (not so original) that includes the following fields:

  • ID (BIGINT): The primary key.
  • USERNAME (VARCHAR): Tweet’s author username.
  • CREATED (TIMESTAMP): Timestamp when the tweet was created.
  • TEXT (VARCHAR): Content of the tweet.
  • RT (INTEGER): Number of retweets.
  • LAT (DOUBLE): Position latitude from where the tweet was created (if defined).
  • LON (DOUBLE): Position longitude from where the tweet was created (if defined).
  • PLACE (VARCHAR): Name of the place from where the tweet was created (if defined).

    We could use the SQL interface to create the table that stores the retrieved tweets through a standard CREATE query. To make our example self-contained, we do this instead using the KiVi interface as follows:

private static Table getOrCreateTable(Session session){
     Database database = session.database();
     String tableName = “TWEETS”;
     if(!database.tableExists(tableName)){
       return database.createTable(
          tableName,
          Arrays.asList(new Field(“ID”, Type.LONG)),
          Arrays.asList(
              new Field(“USERNAME”, Type.STRING),
              new Field(“CREATED”, Type.TIMESTAMP),
              new Field(“TEXT”, Type.STRING),
              new Field(“RT”, Type.INT),
              new Field(“LAT”, Type.DOUBLE),
              new Field(“LON”, Type.DOUBLE),
              new Field(“PLACE”, Type.STRING)
              )
      );
    }
     return database.getTable(TABLE_NAME);
  }

Convert the Tweet to a Tuple

Once we define the table structure to store the tweets, we next define a function for converting from the Twitter4j class representing a tweet (Status) to a table row represented by the Tuple class.

The complete reference of the Status object can be reviewed on the Twitter4j website, but for our needs, the code should be:

<private static Tuple tweetToTuple(Status status, Table table){
     // This ensures that the tuple has the table format
    Tuple tuple = table.createTuple();
     tuple.putLong(“ID”, status.getId());
    tuple.putString(“USERNAME”, status.getUser().getName());
    tuple.putTimestamp(“CREATED”, new Timestamp(status.getCreatedAt().getTime()));
    tuple.putString(“TEXT”, status.getText());
    tuple.putInteger(“RT”, status.getRetweetCount());
     if(status.getGeoLocation() != null){
       tuple.putDouble(“LAT”, status.getGeoLocation().getLatitude());
      tuple.putDouble(“LON”, status.getGeoLocation().getLongitude());
    }
     if(status.getPlace() != null){
       tuple.putString(“PLACE”, status.getPlace().getFullName());
    }
        return tuple;
  }

Tuple acts as a typed map for the tuple fields, so the only remarkable thing in this code is that Tuple works with java.sql.Timestamp (because we’ve defined the CREATED field as a TIMESTAMP column). Then, we must convert the createAt attribute from the java.util.Date to java.sql.Timestamp.

Because GeoLocation and Place are not mandatory attributes in the Status object, we must check if they are null.

 

Streaming from Twitter

Now that we have configured the Twitter and database connections, defined the database table, and created a conversion function to convert from tweet to table row, it’s time to open the stream to receive data.

This is accomplished through the TwitterStream class, which controls the stream, and the StatusListener interface, which receives the data events.

First, we must implement the listener class. To avoid implementing many methods that we won’t use for this example, we’re going to extend the StatusAdapter class and override only the onStatus() method, instead of implementing the complete StatusListener interface.

private static class LeanXcaleListener extends StatusAdapter {
     Table table;
    Session session;
     public LeanXcaleListener(Table table, Session session) {
      this.table = table;
      this.session = session;
    }
     @Override
    public void onStatus(Status status) {
         Tuple tuple = tweetToTuple(status, table);
        // Here we’ve to insert tuples  into the database
    }
  }

Second, we configure and create the TwitterStream object:

ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setDebugEnabled(true);
 TwitterStream twitterStream = new TwitterStreamFactory(cb.build()).getInstance();
twitterStream.addListener(new LeanXcaleListener(table, session));

Then, we can start sampling and filtering the tweets:

twitterStream.sample();
twitterStream.filter(new FilterQuery("datascience").language(“en”);

In this case, to avoid receiving posts about any topic, we define a filter using the expression “datascience” and, considering the data processing for the next post, we’ll receive only tweets in the English language.

In any case, the Twitter4j API doc describes a lot about how to define different filters and options.

Inserting in LeanXcale

If you noticed above, the method onStatus in the listener is defined as follows:

@Override
    public void onStatus(Status status) {
         Tuple tuple = tweetToTuple(status, table);
        // Here we’ve to insert tuples  into the database
    }

Inserting the tuples into the database is called through the table object as follows:

 table.insert(tuple);

We can alternatively use a more efficient version of the method that does not check the unicity of the key and overwrite an existing tuple if one already exists:

table.upsert(tuple);

So, the onStatus() code would look like:

@Override
    public void onStatus(Status status) {
         Tuple tuple = tweetToTuple(status, table);
        table.upsert(tuple);
    }

LeanXcale is a transactional database, so we need to begin, commit, or rollback transactions to store the data effectively. Trying a first approach, we need to do something like:

@Override
    public void onStatus(Status status) {
       Tuple tuple = tweetToTuple(status, table);
       System.out.println("Inserting: " + tuple);
        try {
            session.beginTransaction();        
            table.upsert(tuple);
            session.commit();
      }
      catch (Exception e){
        e.printStackTrace();
        try {
          session.rollback();
        }
        catch(Exception e1){
          e1.printStackTrace();
        }
      }
    }

The above code creates a new transaction per tweet, which is not efficient. At the end of this post, we include a more efficient approach that groups insertions within a single transaction for every 10 tweets.

Closing connections

The TwitterStream object creates a new thread to receive the events from the unending stream, so to close the stream and database connection cleanly, we add this handler to the JVM to close the connections when the process is killed (i.e., a Ctrl+C).

Runtime.getRuntime().addShutdownHook(
          new Thread(() -> {
            twitterStream.shutdown();
            session.close();
          }));

 Running the code

Because we implemented the example using a Java class (Twitter2Lx), the main method looks like:

public static void main(String[] args){
    String endpoint = "kivi:zk://localhost:2181";
    String databaseName = "twitterdb";
    String databaseUser = "APP";
    Settings settings = new Settings();
    settings.credentials(new Credentials()
        .setDatabase(databaseName)
        .setUser(databaseUser)
    );   
    try{
      Session session = SessionFactory.newSession(endpoint, settings);
      Table table = getOrCreateTable(session);
      ConfigurationBuilder cb = new ConfigurationBuilder();
      cb.setDebugEnabled(true);
      TwitterStream twitterStream = new TwitterStreamFactory(cb.build()).getInstance();
      twitterStream.addListener(new LeanXcaleListener(table, session));
      twitterStream.sample();
      twitterStream.filter(new FilterQuery("datascience"));
      Runtime.getRuntime().addShutdownHook(
          new Thread(() -> {
            twitterStream.shutdown();
            session.close();
          }));
    }
    catch (Exception e){
      e.printStackTrace();
    }
  }

To invoke it from the command line:

mvn clean install exec:java -Dexec.mainClass="Twitter2Lx"

Since we added a System.out.println(); in onStatus() method, we can see in real time the tweets processing:

Inserting: TupleImpl{ ID:LONG:1166750906664333313 USER:STRING:Babbel Digital CREATED:TIMESTAMP:2019-08-28 18:34:27.0 TEXT:STRING:RT @schmarzo: How can your #datascience team become more productive working as a team? Get results with these 5 essential lessons 🚀 👩 @Data… RT:INT:0 LAT:DOUBLE:null LON:DOUBLE:null PLACE:STRING:null }

Inserting: TupleImpl{ ID:LONG:1166750927249969152 USER:STRING:Christopher Burnette CREATED:TIMESTAMP:2019-08-28 18:34:32.0 TEXT:STRING:RT @ProgrammerBooks: Exploring Data Science ==> https://t.co/tXW2T1TcfI

 #python #javascript #angular #reactjs #vuejs #perl #ruby #Csharp #… RT:INT:0 LAT:DOUBLE:null LON:DOUBLE:null PLACE:STRING:null }

If you don’t see any activity, then you can modify the filter terms to receive more popular topics.

Checking data from SQL

Although we will dig deeper into the data analysis in future posts, for now, you can connect to your LeanXcale instance through the JDBC interface to view the stored tweets.

Follow the Squirrel SQL or DBeaver tutorial on the LeanXcale website to configure your client using the following URL, replacing localhost by your server address:

jdbc:leanxcale://localhost:1529/twitterdb

And execute some SQL queries like:

 select count(*) from TWEETS
 select username, count(*) from TWEETS group by username

Conclusion

This post demonstrates an easy configuration for a simple pipeline to store data into a LeanXcale database from a stream source, such as Twitter. This is the first step to develop complex data analysis (like NLP) and, in the next posts, we will discuss how the LeanXcale dual interface will help us do analytical queries using SQL and some standard libraries.

In an advanced step, we will do this analysis with live data and build a complete example of real-time analytics.

If you have any doubts running this example or working with LeanXcale, then you can contact me using the information below.

Full code

src/main/java/Twitter2Lx.java

import com.leanxcale.kivi.database.Database;
import com.leanxcale.kivi.database.Field;
import com.leanxcale.kivi.database.Table;
import com.leanxcale.kivi.database.Type;
import com.leanxcale.kivi.session.Credentials;
import com.leanxcale.kivi.session.Session;
import com.leanxcale.kivi.session.SessionFactory;
import com.leanxcale.kivi.session.Settings;
import com.leanxcale.kivi.tuple.Tuple;
import java.sql.Timestamp;
import java.util.Arrays;
import twitter4j.FilterQuery;
import twitter4j.Status;
import twitter4j.StatusAdapter;
import twitter4j.TwitterStream;
import twitter4j.TwitterStreamFactory;
import twitter4j.conf.ConfigurationBuilder;
public class Twitter2Lx {
  private static final String SERVER_URL = "kivi:zk://localhost:2181";
  private static final String DATABASE_USER = "APP";
  private static final String DATABASE_NAME = "twitterdb";
  private static final String TABLE_NAME = "TWEETS";
  private static final String ID_FIELD = "id";
  private static final String USER_FIELD = "username";
  private static final String CREATED_FIELD = "created";
  private static final String TEXT_FIELD = "text";
  private static final String RT_FIELD = "rt";
  private static final String LAT_FIELD = "lat";
  private static final String LON_FIELD = "lon";
  private static final String PLACE_FIELD = "place";
  public static void main(String[] args){
    Settings settings = new Settings();
    settings.credentials(new Credentials().setDatabase(DATABASE_NAME).setUser(DATABASE_USER));
    try{
      Session session = SessionFactory.newSession(SERVER_URL, settings);
      Table table = getOrCreateTable(session);
      ConfigurationBuilder cb = new ConfigurationBuilder();
      cb.setDebugEnabled(true);
      TwitterStream twitterStream = new TwitterStreamFactory(cb.build()).getInstance();
      twitterStream.addListener(new LeanXcaleListener(table, session));

      twitterStream.sample();
      twitterStream.filter(new FilterQuery("datascience"));

      Runtime.getRuntime().addShutdownHook(
          new Thread(() -> {
            twitterStream.shutdown();
            session.close();
          }));
    }
    catch (Exception e){
      e.printStackTrace();
    }
  }
  private static Tuple tweetToTuple(Status status, Table table){
    Tuple tuple = table.createTuple();
    tuple.putLong(ID_FIELD, status.getId());
    tuple.putString(USER_FIELD, status.getUser().getName());
    tuple.putTimestamp(CREATED_FIELD, new Timestamp(status.getCreatedAt().getTime()));
    tuple.putString(TEXT_FIELD, status.getText());
    tuple.putInteger(RT_FIELD, status.getRetweetCount());
    if(status.getGeoLocation() != null){
      tuple.putDouble(LAT_FIELD, status.getGeoLocation().getLatitude());
      tuple.putDouble(LON_FIELD, status.getGeoLocation().getLongitude());
    }
    if(status.getPlace() != null){
      tuple.putString(PLACE_FIELD, status.getPlace().getFullName());
    }
    return tuple;
  }
  private static Table getOrCreateTable(Session session){
    Database database = session.database();
    if(!database.tableExists(TABLE_NAME)) {
      return database.createTable(TABLE_NAME, Arrays.asList(new Field(ID_FIELD, Type.LONG)),
          Arrays.asList(new Field(USER_FIELD, Type.STRING), new Field(CREATED_FIELD, Type.TIMESTAMP),
              new Field(TEXT_FIELD, Type.STRING), new Field(RT_FIELD, Type.INT),
              new Field(LAT_FIELD, Type.DOUBLE), new Field(LON_FIELD, Type.DOUBLE),
              new Field(PLACE_FIELD, Type.STRING)));
    }
    return database.getTable(TABLE_NAME);
  }
  private static class LeanXcaleListener extends StatusAdapter {
    int count;
    Table table;
    Session session;
    public LeanXcaleListener(Table table, Session session) {
      this.table = table;
      this.session = session;
      this.count = 0;
    }
    @Override
    public void onStatus(Status status) {
      try {
        if(count == 0){
          session.beginTransaction();
        }
        Tuple tuple = tweetToTuple(status, table);
        table.upsert(tuple);
        System.out.println("Inserting: " + tuple);
        count++;
        if(count > 5){
          count = 0;
          session.commit();
          System.out.println("Commit");
        }
      }
      catch (Exception e){
        e.printStackTrace();
        try {
          System.out.println("Rollback");
          session.rollback();
        }
        catch (Exception e1){
          e1.printStackTrace();
        }
      }
    }
  }
}
src/main/resources/twitter4j.properties
oauth.consumerKey = // Your consumer key
oauth.consumerSecret = // Your consumer secret
oauth.accessToken = // Your access token
oauth.accessTokenSecret = //Your access token secret
/pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.leanxcale</groupId>
  <artifactId>twitterClient</artifactId>
  <version>0.1-SNAPSHOT</version>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>8</source>
          <target>8</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <properties>
    <kivi-api.version>0.96-SNAPSHOT</kivi-api.version>
  </properties>
  <dependencies>
    <dependency>
      <groupId>org.twitter4j</groupId>
      <artifactId>twitter4j-stream</artifactId>
      <version>4.0.6</version>
    </dependency>
    <dependency>
      <groupId>com.leanxcale</groupId>
      <artifactId>kivi-api</artifactId>
      <version>${kivi-api.version}</version>
    </dependency>
  </dependencies>
</project>
 

Written by

JLM.jpg

Javier López Moratalla, Software Engineer at LeanXcale

After more than 15 years developing, designing and deploying enterprise software at companies like Hewlett Packard Enterprise or Gemalto, I’m now part of the LeanXcale team, researching and working on the next generation of databases.

javier@leanxcale.com

https://www.linkedin.com/in/javier-lopez-moratalla/

 

Interview with Patrick Valduriez

Top database researcher and new scientific advisor at LeanXcale, Patrick Valduriez, talks to us about the latest trends on database world, his new position at LeanXcale, and the upcoming edition of the best-selling book he co-authors with Professor Tamer Özsu, from the University of Waterloo in Canada.

Hi Patrick, and thank you very much for being here with us. It’s a pleasure to talk with you about the databases world. But first, we want to know more about your incorporation to LeanXcale.

Q. You have recently joined LeanXcale as Scientific Advisor. Why did you take that step?

Patrick Valduriez: It is a great opportunity for me, and probably the right time, to go deeper in applying the principles of distributed and parallel databases on real-world problems. LeanXcale has a disruptive technology that can make a big difference on the DBMS market. I am pleased to be part of this exciting adventure and have the chance to work with a great team of researchers and engineers in Europe.

 

Q. How did you meet LeanXcale for the first time?

Patrick Valduriez: I first met Professor Ricardo Jimenez-Peris (LeanXcale’s CEO and founder) in 2005 at a VLDB workshop on database replication, where we both gave talks. After some discussion, it became obvious that both of us could learn much from each other. Ricardo is a leading expert in transaction management and database replication, which nicely complements my expertise in distributed and parallel query processing. Thus, we started doing joint research on distributed and parallel data management and became good friends since then. In 2013, Ricardo invited me to participate in the CoherentPaaS European Project, in which we developed the CloudMdsQL polystore. During the project, LeanXcale was created. Since then, our collaboration has continued, producing excellent research results.

 

Q. Could you give us an outlook of the database market nowadays?

Patrick Valduriez: For the last 30 years, the database market has been dominated by relational DBMSs, which have proved effective in mission-critical application domains (e.g., transaction processing and business intelligence). In particular, the SQL language has fostered their wide adoption, both from tool vendors and application developers. However, with the advent of big data, RDBMSs have been criticized for their “one size fits all” approach. As an alternative solution, more specialized NoSQL DBMSs, such as key-value stores, document stores and graph DBMSs, have emerged, able to scale out in large clusters of commodity servers. However, scalability has been typically achieved by relaxing database consistency. NewSQL is a recent class of DBMS that seeks to combine the scalability of NoSQL systems with the strong consistency and usability of RDBMSs. An important class of NewSQL is Hybrid Transaction and Analytics Processing (HTAP) whose objective is to perform real-time analysis on operational data, thus avoiding the traditional separation between operational database and data warehouse and the complexity of dealing with ETLs. LeanXcale is at the forefront of the HTAP movement, with a disruptive technology that provides ultra-scalable transactions, polyglot queries, key-value capabilities, and many others.

 

Q. You have been involved in the startup world before. How was that experience?

Patrick Valduriez: In the 1990s, I managed Dyade, a joint venture between Bull and Inria, to foster the development of core technologies in information systems. Dyade was a great success, with some major technology transfers into Bull products and four startups that are still in business (TrustedLogic, Kelkoo, Jalios and Scalagent). I was directly involved in the transfer of the Disco technology (Internet data integration system), which I developed with my Inria team, to KelKoo, a successful price comparator. I learnt a lot from this experience, in particular that, in addition to excellent research results, strong knowledge of the business domain and good vision of the future are critical.

 

Q. With Professor Tamer Özsu (University of Waterloo, Canada), you are co-author of “Principles of Distributed Database Systems”, the bestselling textbook on the topic. From the first edition published in 1991 to the upcoming fourth edition, how has the world of distributed database systems evolved?

 

Patrick Valduriez: Distributed database systems have moved from a small part of the worldwide computing environment a few decades ago to mainstream today. The editions of the book reflect such impressive evolution. The first edition describes relational distributed database systems, involving just a few geo-distributed sites. The second edition introduces single site distributed database systems, also called parallel DBMSs, and object-oriented distributed database systems. The third edition reflects an accelerated investigation of distributed data management technologies over the preceding period in the context of P2P, cluster, XML, data streaming, Web data integration systems and cloud computing. As a result, the book has become quite big (850 pages).

 

Q. What are the main updates on the new edition of the book?

Patrick Valduriez: First, to make room, we removed some background material, which is now well presented elsewhere, and reorganized and updated previous chapters. Second, we added new material on recent hot topics such as big data, NoSQL, NewSQL, polystores, web data integration and blockchain. As a short preview, note that there is a section on LeanXcale’s ultra-scalable transaction management approach in the transaction chapter and another section on LeanXcale’s architecture in the NoSQL/NewSQL chapter. My co-author and I thought these deserved to be in the book.

 

Q. As Scientific Advisor of LeanXcale, what is your role?

Patrick Valduriez: I see my role as a sort of consulting chief architect for the company, providing advice on architectural and design choices as well as implementation techniques. I will also do what I like most, i.e., teach the engineers the principles of distributed database systems, do technology watch, write white papers and blog posts on HTAP-related topics, and do presentations at various venues.

 

Q. What are you currently working on at LeanXcale?

Patrick Valduriez: The first topic is query optimization, based on the Calcite open source software, where we need to improve the optimizer cost model and search space, in particular, to support bushy trees in parallel query execution plans. The second topic is to add a JSON data type in SQL, inspired by the now famous SQL++ language, in order to combine the best of relational DBMS and document NoSQL DBMS.

 

Q. Is there anything else you would like to mention?

Patrick Valduriez: Well, the adventure just got started and it is already a lot of fun. I like to learn from real problems, and LeanXcale has great use cases to satisfy my curiosity and creativity. I want to thank the company for its trust in me.


 
Patrick.jpg

Patrick Valduriez

Patrick Valduriez is a senior scientist at Inria in France. He has been a scientist at Microelectronics and Computer Technology Corp. in Austin (Texas) in the 1980s and a professor at University Pierre et Marie Curie (UPMC) in Paris in the early 2000s. He has also been consulting for major companies in USA (HP Labs, Lucent Bell Labs, NERA, LECG, Microsoft), Europe (ESA, Eurocontrol, Ask, Shell) and France (Bull, Capgemini, Matra, Murex, Orsys, Schlumberger, Sodifrance, Teamlog). A successful career that has been recognized with prestigious awards and prizes, such as the 1993 IBM scientific prize in France, the VLDB2000 best paper award and the 2014 Innovation Award from Inria – French Academy of Science – Dassault Systems. Now, as a part-time consulting job, he engages in a new adventure as Scientific Advisor at LeanXcale.

My visit to the MIT

Originally posted by Ricardo Jimenez on LinkedIn

https://www.linkedin.com/pulse/my-visit-mit-ricardo-jimenez-peris/

20190318_154406 (1).jpg

Taking advantage of my visit to Boston due to our stand at Enterprise Data World, I visited a couple of groups in MIT from the Media Lab.

The visit was really interesting. I was first with Esteban Moro, who is doing a very interesting work about extracting geo-temporal insights for users of an app at very large scale. It turns out to be a very interesting use case for LeanXcale database and make use of its dual interface, key-value and SQL. The problem to be solved implies to ingest data at very high rates and very large volumes what makes the key-value interface quite suitable, while at the same time making many queries, based on keys, ranges and distance.

Later, I met with Luis Alonso, who is doing an amazing work with the next evolution of smart cities. Till now most of what I have seen around smart cities, it is just small data and not providing any appealing insight. However, they are taking a holistic approach in which they have managed to involve different data providers at different cities to really discover interesting insights that can help to improve the efficiency of cities. In the photo with him, you can see one of the models of the smart cities they have. They use Lego blocks to build a 3d model of the city, and then with an overhead projector they project the behaviour of people segmented according different parameters, such as nationality, gender, age, and one can see during different events how the behaviour is different across population segments. Really interesting work and turns out to be also a very interesting use case for LeanXcale since they require correlating massive amounts of data in real-time.

Top database researcher Patrick Valduriez joins LeanXcale team

We are happy to announce that Patrick Valduriez has now joined LeanXcale as scientific advisor!

Patrick.jpg

Patrick is a senior scientist at Inria, a world famous research organization in computer science and mathematics in France. He has also been a senior researcher at Microelectronics and Computer Technology Corp. in Austin (Texas) in the 1980s and a professor at University Pierre et Marie Curie (UPMC) in Paris in the early 2000s. He has made extensive research contributions to distributed and parallel database systems and successful technology transfers to companies.

He has been a consultant for major companies in USA (HP Labs, Lucent Bell Labs, NERA, LECG, Microsoft), Europe (ESA, Eurocontrol, Ask, Shell) and France (Bull, Capgemini, Matra, Murex, Orsys, Schlumberger, Sodifrance, Teamlog). Now, LeanXcale joins this impressive list.

Patrick has received prestigious awards and prizes, including the best paper award at the VLDB conference in 2000, the IBM France scientific prize in 1993 and the Innovation Award from Inria-Académie des Sciences-Dassault Systems in 2014. He is an ACM Fellow.

Patrick is now involved in the launch of the fourth edition of the best-selling book “Principles of Distributed Databases”, co-authored with Professor Tamer Özsu from University of Waterloo. Since its first edition back in 1991, the book has become the leading textbook on distributed data management.  The new edition will feature new material on recent hot topics such as big data, NoSQL, HTAP, web data integration and blockchain.

In this new adventure as advisor, Patrick will share his knowledge with LeanXcale’s engineers to improve their HTAP SQL engine and evangelize the company’s disruptive technology.