Software Development blog: August 2018

Wednesday, August 29, 2018

2D Transform Pattern

Intro

Today is one more of my pattern ideas. Nothing very modern today, just a problem I ran into once while developing a UI. I hope someone will find this pattern interesting.

Motivation

This pattern formally describes a subclass of problems in programming. It is very common for a program to transform data from one format to another. This pattern describes the transformation from a 1D array to a 2D tree structure.

1D arrays often may come from the configuration or the database. It is much easier to represent data as a flat array in the configuration file or in the table. The access time is better, the data structure is simpler. Computer programs are very good at working with 1D arrays.
But that is the programming language side. For the user long 1D arrays are very hard to read even if they are sorted. Humans prefer structured data like trees.

Here is what we have:

0	1	2	...	n
Object	Object	Object	...	Object

And here is what we'd like to show to the user:

The folders could be themselves objects or they could represent a collection of objects.

Applicability

Use this pattern when you need to transform a one-dimensional array of objects into a tree-like structure.

Structure

In order to transform the array into a tree structure this structure must exist before the transformation. The purpose of the code is to fill it in with the appropriate objects. This structure must consist of 2 node types:

Folders. They are not objects, they contain other folders or objects as children. Within the scope of this pattern we consider only "static" folders. Folders are not created during transformation.
Object-folders. These are objects that can contain other objects as children. They are selected from the initial array.

We have 2 things in the beginning: the flat array of elements and a tree structure. Here is an example of the tree structure:

The most important part of the pattern is the way to select objects. The "rules to select objects" are represented as template nodes with one or more selector nodes. Like

public class Template {

private List<Selector> selectors;

...

The selector selects some nodes from the array. Any format can be used to specify the rules for the selector. The selectors are processed in the order they are added and the nodes they select from the array are added to the template. If this is an object-folder then only one object may be selected. Here is what the tree structure should look like:

Consequences

This structure with two tiers (Template and Selector) has some important advantages:

Simple rules. Complicated rules can be broken into several simple ones with one selector for every simple rule. This could be even more important if the rules are regular expressions.
Order control. The order in which the objects are selected can be controlled with proper selectors. Adding selectors in the proper order could achieve the same result as a complicated sorting procedure. In the case special sorting is required that cannot be achieved with selectors the template can use a sorting function.
UI Design. The selector may specify how the selected objects should look to the user. For example the selector may add an icon to the objects or specify the background color.

Final Thoughts

I do not claim to have invented something completely new. I tried to explain in clear terms how to solve this problem of transformation from a 1D array to a 2D tree structure.

Monday, August 27, 2018

Javascript Inheritance

Intro

Inheritance in Javascript is bad! Everybody who has ever used Javascript knows that. The concept of "method inheritance" and prototypes could be even hard to grasp for someone who is used to more modern languages like Java. It was hard for me as well. But what if you really need to develop a small hierarchy in Javascript? Of course, there are some alternatives like GTW or Typescript with better support for classes and inheritance. But what if your boss just likes Javascript or the customer doesn't want to hear about any modern replacements? And Javascript has a lot of frameworks and libraries. Actually there are approaches to making a small hierarchy possible.

Abstract Factory Approach

Here is what we need to accomplish:

We cannot declare these properties on the functions and then write

BaseObject.prototype = new VeryBaseObject();

ConcreteObject.prototype = new BaseObject();

because modifying any of the maps from BaseObject or VeryBaseObject will change the maps for all descendants.

Solution

We are going to use a factory to create objects. It won't be an abstract factory just a factory. And use the reference to the actual object to set the properties on the object itself. We add some init functions that will create our properties:

Then in the factory we'll call these methods from the ConcreteObject like this:

switch(objectType) {

case OBJECT_TYPE_CONCRETE1:

obj = Object.create(new ConcreteObject1());

break;

...

}

obj.initvbo();

obj.initbo();

obj.initCustom();

If the function is called from the ConcreteObject instance this will be that instance. And we'll have the necessary properties installed on the instance without affecting the prototype.

Here is the object structure that we'll get:

With the properties on the ConcreteObject instance we can get a Java-like inheritance on Javascript.

In fact it could be a bit more complicated than that if different functions need to be called for different object types. Then the method that creates the object must be modified to call the appropriate functions.

This simple solution may help if it is REALLY necessary to create and use a Java-like object hierarchy in Javascript.

Spring Data with Thymeleaf

Intro

Today I'll talk about more specific issues. No design patterns or algorithms this time :-). We don't always design software components from scratch. Often we have to try to make existing software components work together.

Spring Boot is one the best free software in the Java world. It resolved a lot of configuration issues with Spring. It is very flexible and offers great functionality.

Spring Data is part of the Spring collection of projects. It offers advanced tools for working with databases. Among the most useful is the automatic repository. A class can implement JpaRepository and most methods for working with data will be created automatically.

Thymeleaf is a HTML template engine. It can use some of Spring Boot's features, like call methods of Spring beans in the template and a lot of other stuff. The official documentation has great tutorials.

I used spring-boot-starter-parent versions 2.0.1.RELEASE - 2.0.4.RELEASE. Other dependencies were provided by Spring Boot.

Problem Description

The main idea of any application that is working with Spring Boot, Spring Data and Thymeleaf is to edit data in the database. spring-boot-starter-data-jpa includes Hibernate which can be used to manipulate the data in the database. Thymeleaf can be used to show the data to the user. Spring Boot wires it all together.

A very simple scenario includes one entity with a one-to-many relationship with another entity. The user wants to be able to create a new entity and select the other entity in a HTML select box.

Here is where the first issue shows up. With the standard Thymeleaf structure the backing bean cannot be assembled.The object that was selected in the select box with the following construct:

<form action="#" th:action="@{/<some Action>}" th:object="${beanObj}" method="post">

.... <other fields>

<option th:each="currRoom : ${allRooms}"

th:value="${currRoom}" th:text="${currRoom.name}">no

name</option>

</select>

</form>

is not created by Thymeleaf. I didn't find any mention of this in the official documentation.

Solution

After some debugging I found the root cause. It turned out Thymeleaf passes all the fields as parameters to the POST request. It uses the toString method to transform the object to String and add as a parameter to the POST request. It sends a parameter like this:
room: Room+[id=273,+name=room111]

In the controller method this value must be transformed back to the object form. Spring Boot uses converters to do this.

The solution is - register the appropriate converters with the conversionService. And use these converters in the toString method of the entities to make sure the same method is used to convert to the String form and back.

Next Problems

Sounds funny isn't it? The solution has been found but more problems? Actually the described solution works well without Spring Data. With Spring Data the conversion fails again. And Spring Boot wants you to create the entityManagerFactory bean even though this bean was not needed without Spring Data.

Next Solutions

The problem with the entityManagerFactory bean can be resolved by means of some intensive search on the Internet. Here is the solution I ended up with:

@Primary

@Bean

public LocalContainerEntityManagerFactoryBean entityManagerFactory(DataSource ds) {

LocalContainerEntityManagerFactoryBean em = new LocalContainerEntityManagerFactoryBean();

em.setDataSource(ds);

em.setPackagesToScan("<some packages>");

JpaVendorAdapter vendorAdapter = new HibernateJpaVendorAdapter();

em.setJpaVendorAdapter(vendorAdapter);

em.setJpaProperties(additionalProperties());

return em;

}

@Bean

public SessionFactory sessionFactory(@Qualifier("entityManagerFactory") EntityManagerFactory emf) {

return emf.unwrap(SessionFactory.class);

}

private Properties additionalProperties() {

Properties properties = new Properties();

properties.setProperty("hibernate.dialect", "org.hibernate.dialect.PostgreSQLDialect");

properties.setProperty("hibernate.default_schema", "public");

properties.setProperty("hibernate.show_sql", "true");

// Validation will fail because the tables use bigint as the ID but it is mapped to the Integer type by Hibernate

// Validation expects a 8-bit number as the mapping to bigint.

properties.setProperty("hibernate.hbm2ddl.auto", "none");

return properties;

}

The second problem turned out to be more complicated and required a lot of debugging. Eventually I found out that spring-data somehow changes the conversion service that Spring Boot is using. Instead of the default conversionService with Spring Data mvcConversionService is used. The formatters/converters must be added in your WebMvcConfigurer class (the class that implements WebMvcConfigurer). The method is addFormatters:

@Override

public void addFormatters(FormatterRegistry registry) {

registry.addConverter(new <SomeConverter>);

...

Now with all problems resolved Spring Data can work with Thymeleaf.

Happy coding and diligent debugging!

Here is a link to Javacodegeeks Spring Data Tutorials

Thursday, August 23, 2018

Draft Tree

Intro

I like blogging about different design patterns, algorithms and approaches to solving different problems. This time I'll describe a problem I ran into some time ago and the approach I found to solve it. I always do not claim to have found a unique and absolutely perfect solution by myself. But I like the idea of keeping and reusing successful and interesting approaches to solving problems. Even if the code cannot be reused the idea itself is often very useful. This is not exactly a design pattern in the classical sense but it is more like a "design approach". But I'll describe it as a design pattern.

Motivation

If one wants to work with a DSL these days he or she usually has to use Antlr. This tool is free (BSD license) but is very useful. The grammars are simple enough. The code can be generated for Java, Javascript and other languages. In fact it is the tool of choice for anyone who wants to take on some DSL-related work.

When the grammar is complete and working well what the user has is the grammar tree. It looks like this:

This example is a logical expression from a simple grammar that parses logical expressions. The tree shows the grammar tree for 1 > 0 OR x < Y AND 10 > Z2.

Antlr offers great tools to work with these trees: Listeners and Visitors. Listeners fire when a node is encountered and visitors visit nodes one by one. These tools are really great and help a lot.

But what if the transformation one is trying to accomplish cannot be achieved with Antlr's listeners and visitors? This could be the case if the DSL structure is significantly different from the structure one is trying to achieve (we'll call it 'code structure' from now on. The implication is that the DSL is transformed into source code).

Some people may try to modify the Antlr tree a little, some may add more complicated listeners/visitors. In this blog post I'll argue that there is a better solution.

Applicability

Use this pattern when you need to perform complicated transformations on the Antlr tree that are hard to accomplish directly with a listener or visitor.

Structure

There is a simple and elegant solution to this problem. We need to introduce an intermediate tree structure. In Java this can be accomplished with the class DefaultMutableTreeNode and other classes from javax.swing.tree. The API may look a bit old-fashioned with Enumeration and no generics but it works. The first tree instance is a copy of the Antlr tree with proper user objects in the nodes. Then we can slowly modify this tree step by step to make it look closer to the code structure. Usually if this structure is used there is no need to do it all in one step. Here is an example of the transformation:

This intermediate tree is like a draft that needs to be improved. After some steps it is ready to be transformed into the final code structure.

Consequences

Simplicity. The process of transforming the DSL into the code structure can be defined as simple, concise operations on the draft tree. This makes the transformation code less complex. The implementation may even use a modified version of the Assembly Line pattern.
Loose coupling. All the operations on the draft tree are independent of the original DSL and the final code structure. If either the DSL or the code structure has to be modified the draft tree can be adapted to the new form by adding or removing steps. There is not tight coupling between the DSL and the code structure.

Final Thoughts

No sample code can be provided in this case because any code will be more specific than the design approach itself.

Thursday, August 16, 2018

Tree Creation - Mosaic Algorithm

Motivation

When I hear the word 'mosaic' I imagine an old, possibly ancient Roman, house with mosaic floors which depict ancient heroes. In the ancient times slaves spent many years putting the pieces in the correct order.

The idea of a mosaic algorithm follows a similar pattern. Imagine that you have a lot of binary relations like "node A is the parent of node B", "node C is the child of node D", "nodes B and D are children of the same node". You can get such a set of binary relations from a natural language processing tool. For example the great Standford NLP library can give you a set of relations like

Richard per:children Sonya
Sonya per:siblings Samantha
Samantha per:children Robert

It would be great to build a complete tree from this set of relations. This is what the mosaic algorithm is about.

Applicability

Use this algorithm when you need to create a tree from a set of relations between the nodes. We assume that all the relations are correct even though the algorithm can handle duplicate relations.

Structure

The main idea is to use the set of relations as mosaic pieces. For every relation a fragment of the tree is created or modified. We'll consider 2 cases: a children relation and a sibling relation. Like this:

	Subject	Relation	Object
Children	NodeX	children	NodeY
Siblings	NodeA	siblings	NodeB

This converts to the following structures in the code;

For any of these relations any node may already exist among the created fragments. It is also possible that no nodes exist for this relation. The actions necessary are summarized in the table:

	Subj doesn’t exist, Obj doesn’t exist	Subj exists, Obj doesn’t exist	Subj doesn’t exist, Obj exists	Subj exists, Obj exists
Children	Create 2 nodes: one parent, one child	Find the parent in the existing nodes, add a child node	Create the parent node, find the child in the existing nodes and add it as a child to the parent node	Establish the relation if the nodes belong to different fragments (have different roots)
Siblings	Create 2 nodes with a dummy parent	Find the existing sibling, add a dummy parent if necessary, add a new child to the parent	Find the existing sibling, add a dummy parent if necessary, add a new child to the parent	Establish the relation if the nodes belong to different fragments (have different roots)

As every relation is processed either a new fragment is created or an existing fragment is modified. When relation processing is complete we have the full picture.
Only one important part of the algorithm remains - how to find the existing nodes for a relation. To achieve this after every relation is processed the newly created nodes are added to a Map or ArrayList (If the node has a unique key the HashMap is faster than the ArrayList).
An illustration how the tree is evolving as the relations are being processed:

Final Thoughts

I think I should be humble and avoid claiming that a new algorithm has been invented. This algorithm could be a special case of a well-known more general algorithm. But the idea looked interesting and I decided to write a post about it.

Tuesday, August 14, 2018

Multilevel Adapter Pattern

Intro

This post continues my "new pattern ideas" series. I'll describe a special version of the adapter pattern that can guarantee several levels of access to objects. In particular I'll show 3 main levels - read, write and class access.

Motivation

We'll consider a 2-tiered system:

The top tier consists of object folders. Every folder contains objects of one type
The bottom tier consists of objects themselves

What we need to do is separate access levels to this system. We can have read access, write access and class access at any tier. Read access means getting or reading properties of the folder List, folder or object. Write access means everything from the read level plus being able to add/delete folders to the List, being able to add/delete objects in the folder and write access to the object's properties. Class access implies that the classes for the folder and the object have some additional lower-level detail that can be exposed.

Applicability

Use this pattern when you need to use different access levels to a hierarchy of objects. The hierarchy of access levels is clearly defined with the next level having all the privileges of the previous level plus some privileges of its own.

Structure

In general there could be more complicated models but in this specialized version we'll use the following access levels: read, write, class. We also assume that every next level has all the methods of the previous level plus some method specific to this level. All the objects in the structure need to have the following general type hierarchy:

If read access is required a ReadInterface is returned if a write access is required a WriteInterface is returned. Classes are not public. Class access means there could specific operations within the class that are used only in the class or in the package.

But this is not all. We have 2 tiers: folders and objects. It is very important to make sure that if an outside entity has read access to a folder it must have read access to the objects in this folder. The methods of the read interface like getObject(<parameters>) must return the read interface of the object:

public IReadObject getObject(<parameters>) {

...

}

In accordance with this principle the methods of the write interface must return the write interface of the object:

public IWriteObject getObjectFull(<parameters>) {

...

}

This is the structure for folders:

More tiers can be added with the same principle - read access to a tier means read access to all the lower tiers, write access to a tier means write access to the lower tiers etc. The same level of access down the hierarchy.

It is like for every access level an adapter is created for the implementation class.

Participants

Folders - contain objects

Objects - elements that can be accessed

ReadInterface - an interface at any tier that represent read access to the object at this tier

WriteInterface - an interface at any tier that represents write access to the object at this tier

Implementation Class - the class of the element at this tier

Consequences

Consistent access level to the folders and objects. In general it could be more fine-grained like read access here and write access there. But in most cases read access to some object in the hierarchy means read access to the objects below.
Clearly defined interfaces for all access levels, good encapsulation.

Monday, August 13, 2018

Search for time interval in logs

Intro

This post is indirectly related to my mini-series about log analysis. It would be great to read the two main parts to better understand what I'm talking about. Part 1, Part 2.

This post describes one important problem I ran into while implementing the IDE approach.

Task Description

When someone is working with logs usually he or she needs to investigate only one time interval. The available logs usually span days but the time interval that must be investigated is 1-2 hours. The task is to select all log records within that time interval.

Basic Log Record Regex

In order to select a log record we need a regular expression that matches any log record. For the simple log4j format like

2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message

I found the following regex:
TIME_REGEX((?!(TIME_REGEX)).*\r?\n)*

This regular expression matches both single-line and miltiline log records. Time regex could be

\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d

So if somebody wanted to load all logs into a text window he could open the log files one by one and use Matcher.find() with this regex to get all log records.

This regex is based on the fact that the time regex pattern is never repeated in the body of the log message which is true in 99% of all cases.

Datetime of the Log Record

In order to search for a specific time interval and use other features it makes sense to extract the dtaetime information from the log record. Thankfully this task has been solved by JDK with DateTimeFormatter. It is enough to specify the format for the log type and the date can be extracted. For example for the log record above the format is

yyyy-MM-dd HH:mm:ss,SSS

As soon as we can extract the datetime information we can specify the interval as datetime values not Strings in some specific format.

Search Time

Now that we have found a way to select any log record and extract the date information from it the path forward seems clear:

specify the interval,
select the records one by one
extract the date information from the log record
compare the datetime with the interval
if the datetime is within the interval add this record to the List of found records
after searching through all files show the found records

There is one big issue with this approach: time. With 50 log files 50 MB each it will take hours to scan all them to find 10 MB of records in the interval.

Solution

We can use one trick to filter out the files that do not contain a single record in the interval. We use the fact that the log records in the log files are written one after the other. This means the time of the next record is equal or after the time of this record. For example only 2 situations are possible:

2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message

2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Msg 2

2018-08-10 11:00:56,234 DEBUG [Thread-1] package1.pkg2.Class1 Text Message

2018-08-10 11:00:56,278 DEBUG [Thread-1] package1.pkg2.Class1 Msg 2

I rarely saw some examples where under high load the log records can go in reverse but the difference is in milliseconds. We can consider this difference insignificant for our purpose.

This means if neither the first nor the last record in the file are not in the interval all the records in the file are not in the interval and this file can be filtered out. Java regular expressions have special constructs to find the first and the last records.

The first record:

\ATIME_REGEX((?!(TIME_REGEX)).*\r?\n)*

The last record:
TIME_REGEX((?!(TIME_REGEX)).*\r?\n)*\Z

\A means the beginning of the text, \Z means the end of the text. You can find more details in the javadocs for java.util.regex.Pattern.

The solution is to use a special prescanning technique. Before scanning the whole text of a log file find the first and last records and if non of them is in the interval skip the file. Of the 50 files maybe 1-2 needs to be scanned.

Conclusion

REAL uses this technique to speed up searches for a datetime interval. I found that it takes approximately 5-10 seconds to decide if the file must be skipped. Most of the time if spent executing Matcher.find() for the last record. The first record is found much faster. I think it is possible to speed it up even further by selecting the last 5 MB of a 50 MB file to search for the last record. But even in the current state it is fast enough.

Thursday, August 9, 2018

Crosswalk Pattern

Intro

CompletableFuture is great. It is so great that it definitely deserves more attention. It simplified parallel execution immensely and what is even more important no third party libraries are necessary. But working with it requires some finesse. Today I'll show how to use CompletableFuture with lockable resources. This post continues the series of pattern ideas.

Java's ReentrantReadWriteLock

This very useful class has been around since Java 5. If you have a resource that can be read or written to you can wrap it with a ReentrantReadWriteLock. When a client wants to write it needs to acquire the write lock. If a client wants to read it needs to acquire a read lock. Yes exactly the write lock and a read lock. Many read locks can be held at the same time (as many as 65535). But only one write lock can be held at any time. No read lock can be acquired while the write lock is held (by different threads). The write lock also cannot be acquired while at least one read lock is held. These are the rules of the game. More information can be found in the javadocs of ReentrantReadWriteLock.

Motivation

So what is the problem with CompletableFuture? The problem is in Java a lock is held by a thread. One cannot perform lock.lock() in one thread and lock.unlock() in another. It won't work. This issue becomes particularly important if certain tasks must be executed in parallel and some resource must remain unchanged while the tasks are being executed. These tasks need to take either a read lock or the write lock of the resource and unlock it after all is done.

Applicability

Use this pattern if you need to ensure that some object or objects do not change during some long-running or not so long-running tasks. A necessary requirement is that these object or object are protected by a Java Lock possibly ReentrantReadWriteLock.

Structure

The solution involves using an additional single-threaded executor. It is assumed we can get a Runnable to lock all the locks and a Runnable to unlock all the locks. These Runnables must be executed in the single-threaded executor (STE later on). Here are the steps that must be run in the STE in the text form:

Run the locking Runnable with CompletableFuture.runAsync(runnable, STE)
Check that all locks have been locked (with a LockWrapper for example)
Get the aggregate CompletableFuture for the tasks which are run in a different executor. You can do it in a runnable but it must be done in the STE. The tasks cannot start before the locks have been locked.
Call get() or get (timeout, timeunit) on the aggregate CompletableFuture.
Run the unlocking Runnable

In this case it makes more sense to show the structure of execution not the structure of classes. Here it is:

This single-threaded executor on the diagram looks like a crosswalk, isn't it?

Participants

Single-threaded executor - the executor that obtains the locks and releases them after all tasks have completed execution

Tasks Executor - the executor that executes the tasks in parallel

Locks - the locks that must be locked to ensure some object or objects do not change while the tasks are being executed

Consequences

Can run long-running tasks in a separate executor and ensure some object or objects remain unchanged during the execution
Can use Java's locks and ReentrantReadWriteLock.

Sample code

public CompletableFuture<A> execute() {

CompletableFuture<Void> lockFT = CompletableFuture.completedFuture(null);

if (lockRunnable != null) {

lockFT.thenRunAsync(lockRunnable, ste);

}

CompletableFuture<Void> execFT = lockFT.thenRunAsync(() ->

{

if (<all locks locked>) {

CompletableFuture<T> tasksAggr = <get future>;

// wait until all tasks have completed execution

tasksAggr.get();

} else {

}

ste);

CompletableFuture<Void> unlockFT;

if (unlockRunnable != null) {

// execute unlock even if the previous stage completed exceptionally

unlockFT = execFT.whenCompleteAsync((Void x, Throwable t) -> unlockRunnable.run(), ste);

} else {

unlockFT = execFT;

}

return unlockFT.thenApplyAsync(<return value>, ste);

}

Final Thoughts

In this simple pattern it is not the structure that is important but the threads that execute it. By correctly handling a single-threaded executor and combining it with another possibly multi-threaded one we can use Java's ReentrantReadWriteLock with multiple threads. This single-threaded executor serves as a crosswalk that the operations must use to cross the road. If some operations veer off this crosswalk this scheme will fail.

Tuesday, August 7, 2018

IDE approach to log analysis pt. 2

Intro

In the first part I explained the theoretical approach to log analysis that I think is best for a sustain engineer. This engineer doesn't need to analyze logs immediately as they come but instead is focused on a deep analysis of complicated issues. In this second part I'll show that many search scenarios can be covered with one sophisticated template and show a working prototype.

Search Object Template

The main requirement for the search template is it must be sophisticated, very sophisticated in the best case. The less manual search the better. A sophisticated template should do most of the work and do it fast. As we don't have any servers here only the developer's PC which is expected to handle 2-3 GB of logs speed is also important.

Main Regular Expressions

The template should declare some regular expressions which will be searched for (with Matcher.find) in the logs. If more than one is declared first the results for the first are collected, then for the second etc. In the most general sense the result of a search is an array of String - List<String>.

Acceptance Criteria

Not all results are accepted by the searching process. For example the engineer can search for all connection types excluding "X". Then he or she can create an acceptance criterion and filter them out. by specifying a regex "any type but X". Another possibility is searching within a time interval. The engineer can search for any log record between 10 and 12 hours (he or she has to enter the complete dates of course).

Looking for distinct expressions is also possible. In this case the engineer specifies one more regular expression (more than one in the general case). An example will explain this concept better.

distinct regex: connection type (q|w)

log records found by the main regex:

connection type w found

connection type q created

connection type s destroyed

connection type q found

The result of a distinct search:

connection type w found

connection type q created

Parameters

One of the issues with regular expressions is that really useful regular expressions are very long and unwieldy. Here is a sample date from a log:

2018-08-06 10:32:12.234

And here is the regex for it:

\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d.\d\d\d

The solution is quite simple - use substitution. I call them parameters for the regex. Some parameters may be static like the time for the record but some may be defined by the user. Immediately before the execution the parameters are replaced with the actual values.

Views

The result of the search is a log record i.e. something like

2018-08-06 10:32:12.234 [Thread-1] DEBUG - Connection 1234 moved from state Q to state W \r?\n

While it is great to find what was defined in the template it would be even better to divide the information into useful pieces. For example this table represents all the useful information from this record in a clear and concise way:

Connection	1234	Q	->	W

To extract this information pieces we can use the "view" approach. This means declaring smaller regexes that are searched for in the log record and return a piece of information about the log record. It is like a view of this log record. Showing it all in a table makes it easier to read. Also a table can be sorted by any column.

Sort and Merge

The most efficient way to make this kind of search with the template is use a thread pool and assign every thread to a log file. Assuming there are 3-4 threads in the pool the search will work 3-4 times faster. But merging results becomes an important issue. There can be 2 solutions here:

Merging results. We need to make sure that the results go in the correct order. If we have 3 log files, the first one covering 10-12 hours, the second 12-14, the third 14-17 then the search results from those file must go in the same order. This is called merging.
Sorting results. Instead of merging them we can just sort them by date and time. Less sophisticated but simple.

Merging looks like a more advanced technique which allows us to keep the original order of records.

Workflow

Final Thoughts

The question that must be nagging everyone who has reached this point in this post is: Has anyone tried to implement all this? The answer is yes! There is a working application that is based on the Eclipse framework, includes a Spring XML config and a lot of other stuff. The search object templates work as described in this article.

Here is the Github link:

https://github.com/xaltotungreat/regex-analyzer-0
Why 0? Well it was meant to be a prototype and to some extent is still is. I called this application REAL
Regular
Expressions
Analyzer
for Logs

It is assumed the user has some knowledge how to export an Eclipse RCP application or launch it from within the Eclipse IDE. Unfortunately I didn't have enough time to write any good documentation about it. By default it can analyze HBase logs and there are a lot of examples in the config folder.