Development

Development

To get info about new technologies, perspective products and useful services

BigData

BigData

To know more about big data, data analysis techniques, tools and projects

Refactoring

Refactoring

To improve your code quality, speed up development process

Month: January 2019

HowTo: Run Cloudera Quickstart in Docker

HowTo: Run Cloudera Quickstart in Docker

The best way to familiarize yourself with Hadoop ecosystem or to do POC is to play with it in a sandbox. For that Cloudera provides 2 Quick Start options: one is the image for VirtualMashine and another is the Docker image. Both of them contain a small CDH cluster with one DataNode. But this is more than enough for sandbox purpose.

However, most of  data engineering neophytes find it difficult to set it up, since, unfortunately,  official Cloudera installation guide missing some essential parts on twinks after installation. Here you will find quick and easy how to setup QuickStart from Cloudera with Docker image on Linux.

If you have installed docker image already and don’t want to repeat those steps again, just move to the part with the solution or to the fixed docker image.

Run QuickStart according to official documentation

Check if you have docker installed. In terminal type: docker -v to check version. (This HOWTO doesn’t cover docker installation, the manual  from official docker site).

Check status of the docker service on your machine, start it if needed:

sudo systemctl status docker
sudo systemctl start docker

In my case docker was stopped, so I had to start it. Now when docker is installed and ready to work, we need to download QuickStart Image: sudo docker pull cloudera/quickstart:latest.  Depends on your connection download operation can take time – be patient :).

After successful download our Cloudera QuickStart image should appear in the list of docker images: docker images.  Copy <IMAGE ID> of ‘cloudera/quickstart’, we will use it later. In my case it is: ‘4239cd2958c6’. Finally, main command to run docker image as container is:

sudo docker run  --hostname=quickstart.cloudera --privileged=true -t -i -p 8888 -p 7180 4239cd2958c6 /usr/bin/docker-quickstart -d

/usr/bin/docker-quickstart #Entry point to start all CDH services. Provided by cloudera
--hostname=quickstart.cloudera #Required: pseudo-distributed configuration assumes this hostname
--privileged=true #Required: for HBase, MySQL-backed Hive metastore, Hue, Oozie, Sentry, and Cloudera Manager, and possibly others
-t  #Required: once services are started, a Bash shell takes over and will die without this
-i  #Required: if you want to use the terminal, either immediately or attach later
-p 8888  #Recommended: maps the Hue port in the guest to another port on the host
-p 7180  #Recommended: maps the Cloudera Manager port in the guest (7180) to another port on the host
-p [PORT] #Any other ports you want to remap from guest to free host ports. To make it accessible outside container. 
-d   #Optional: runs the container in the background. I would recommend to use this option if you planning to run container constantly on background.

Fix standard image to get it work

Further starts tuning to make your sandbox operational. By default, cloudera manager is not started in container, so, lets enable it first.

Obtain <CONTAINER_ID> value docker ps. In my case it is: ‘5fadd6cb8e0c’

Connect  to QuickStart container shell: docker attach <CONTAINER_ID> and run script to enable cloudera manager: /home/cloudera/cloudera-manager --express

At the end script prints out an address on which you can access cloudera-manager. Surprisingly, you won’t be able to connect to it. That’s happening because we remaped port earlier. And now we need to get correct host port to which our guest port 7180 was mapped.

Detect new port mapping from guest to host: sudo docker port <CONTAINER_ID> <guest_port>. Thus, to connect to Cloudera manager I need to type: ‘0.0.0.0:32771’ and finally we will see this:

 But wait a second, something is red at our Hosts. Let’s take a look on our error closer and fix it. It’s clock offset error:

Basically it means that our ntpd service is either not started or can’t connect to services. The solution is:

date      # will show difference between real date and server one
sudo chkconfig --add ntpd
sudo service ntpd restart 
date      # to make sure that ntpd is working and date is sync

Wait a couple of minutes and check cloudera manager again. As you see error has been gone.

Since now, until container won’t be turned off, ntpd service continue to work properly. If you are lazy to run additional command at cloudera docker container every time it was rebooted, you simply can use my docker image, based on Cloudera QuickStart one, or create one yourself.

Scala as backend language. Tips, tricks and pain

Scala as backend language. Tips, tricks and pain

I’ve got a legacy service, written in Scala. Stack was: Play2, Scala, Slick, Postgres.

Here is described why such technology stack is not the best option, what should be done to make it work better with less efforts and how to avoid underwater rocks.

For impatient:
If you have choice – don’t use Slick.
If you have more freedom – don’t use Play.
And finally – try to avoid Scala on the back-end. It might be good for Spark applications, but not for the backends.

Data layer

Every backend with persistence data needs to have data layer.

From my experience the best way of code organizing is repository pattern. You have your entity (dao) and repository, which you access when you need to do some manipulations with data. Nowadays modern ORMs are your friends here. They do a lot of things for you.

Slick – back in 2010

It was my first thought, when I started using it. In Java you can use Spring-data, which generates a repository implementation for you. All you need is to annotate your entity with JPA and write repository interface.

Slick is another thing. It can work in two ways.

Manual definition

You define your entity as a case class, mentioning all needed fields and their types:

case class User(
    id: Option[Long],
    firstName: String,
    lastName: String
)

And then you manually repeat all the fields and their types defining the schema:

class UserTable(tag: Tag) extends Table[User](tag, "user") {
    def id = column[Long]("id", O.PrimaryKey, O.AutoInc)
    def firstName = column[String]("first_name")
    def lastName = column[String]("last_name")

    def * = (id.?, firstName, lastName) <> (User.tupled, User.unapply)
}

Nice. Like in ancient times. Forget about @Column automapping. In case you have DTO and you need to add a field you should always remember to add it to DTO, DAO and schema. 3 places.

And have you seen insert method implementation?

def create(name: String, age: Int): Future[Person] = db.run {
    (people.map(p => (p.name, p.age))
returning people.map(_.id)
into ((nameAge, id) => Person(id, nameAge._1, nameAge._2))
    ) += (name, age)
}

I used to have save method defined somewhere in abstract repository only once and have it in one line, something like myFavouriteOrm.insert(new User(name, age)).

Full example is here: https://github.com/playframework/play-scala-slick-example

I don’t understand why Play’s authors say ORM’s “will quickly become counter-productive“. Writing manual mapping on real projects would become a pain much faster then abstract “ORM counter-productivity“.

Code generation

The second approach is code generation. It scans your DB and generates the code, based on it. Like reversed migration. I didn’t like this approach completely (it was in the legacy code I’ve got).

First, to make it working you need to have db access at compile time, which is not always possible

Second, if backend owns the data – it should be responsible for the schema. It means there should be schema from code or code changes + migration with schema changes in the same repository.

Third, have you seen the generated code? Lots of unnecessary classes, no format (400-600 characters in a line), no ability to modify this classes, by adding some logic or extending an interface. I had to create my own data layer, around this generated data layer 🙁

Ebean and some efforts to make it work

So, after fighting with Slick I’ve decided to remove it together with data layer completely and to use another technology. I’ve selected Ebean, as it is official ORM for Play2 + Java. Looks like Play developers don’t like Hibernate for some reason.

Important thing to notice – it is Java ORM and Scala is not supported officially (its support was dropped a few years ago). So you need to apply some efforts to make it work.

First of all – add jaxb libraries to your dependencies. They were removed in Java 9. So on 9+ Java your app will crash at runtime without them.

libraryDependencies ++= Seq(
  
"javax.xml.bind" % "jaxb-api" % "2.2.11",
  
"com.sun.xml.bind" % "jaxb-core" % "2.2.11",
  
"com.sun.xml.bind" % "jaxb-impl" % "2.2.11",
  
"javax.activation" % "activation" % "1.1.1"
)

Next – do not forget to add jdbc library and driver library for your database.

After it you are ready to set up your data layer.

Entity

Write your entities as normal java entities:

@Table(name = "master")
@Entity
class Master {
  @Id
  @GeneratedValue(strategy = GenerationType.AUTO)
  @Column(name = "master_id")
  var masterId: Int = _

  @Column(name = "master_name")
  var masterName: String = _

  @OneToMany(cascade = Array(CascadeType.MERGE))
  var pets: util.List[Pet] = new util.ArrayList[Pet]()
}

Basic Scala types are supported, but with several limitations:

  • You have to use java.util.list in case of one/many-to-many relationship. Scala’s ListBuffer is not supported as Ebean doesn’t know how to de/serialize it. Scala’s List also, as it is immutable and Ebean can’t populate it.
  • Primitives like Int or Double should not be nullable in the database. If you have it nullable – use java.lang.Double (/ Int) or you will get exception as soon as you will try to load such object from the database, because Scala’s Double is compiled to double primitive, which can’t be null.
    Scala’s Option[Double] won’t work, as ORM will return null instead of Option[null].
  • Relations are supported, including bridge table, which is also created automatically. But, because of the bug, @JoinColumn can’t be specified.
  • Ebean uses java lists, so you need to use scala.collection.JavaConverters every time you are planning to use lists in query (like where.in) and every time you return a list (like findList).
Repository

It is (the only) nice thing in Scala, which can be useful here: trait can extend abstract class. It means you can create your abstract CRUD repository and use it in business repositories. Like you have out of the box in Spring-Data 🙂

1. Create your abstract repository:

class AbstractRepository[T: ClassTag] {
  var ebeanServer: EbeanServer = _

  @Inject()
  def setEbeanServer(ebeanConfig: EbeanConfig): Unit = {
    ebeanServer = Ebean.getServer(ebeanConfig.defaultServer())
  }

  def insert(item: T): T = {
    ebeanServer.insert(item)
    item
  }

  def update(item: T): T = {
    ebeanServer.update(item)
    item
  }

  def saveAll(items: List[T]): Unit = {
    ebeanServer.insertAll(items.asJavaCollection)
  }

  def listAll(): List[T] = {
    ebeanServer.find(classTag[T].runtimeClass.asInstanceOf[Class[T]])
      .where().findList().asScala.toList
  }

  def find(id: Any): Option[T] = {
    Option(ebeanServer.find(classTag[T].runtimeClass.asInstanceOf[Class[T]], id))
  }
}

You need to use classTag here to determine the class of the entity.

2. Create your business repository trait, extending this abstract repository:

@ImplementedBy(classOf[MasterRepositoryImpl])
trait MasterRepository extends AbstractRepository[Master] {
}

Here you can also set up some special methods, which will be used only in this repository.

In the implementation you need to define only methods from MasterRepository. In case of none – just leave it empty. Methods from the AbstractRepository will be accessible anyway.

@Singleton
class MasterRepositoryImpl extends MasterRepository {
}

After data layer refactoring ~70% of code was removed. The main point here – functional staff (FRM and other “modern” things) can be useful only in case you don’t have business objects. F.e. you are creating telecom back-end, which main intent is to parse network packages, do something with it’s data and fire them to the next point of your data pipeline. In all other cases, when your business logic touches real world – you need to use object oriented design.  

Bugs and workarounds

I’ve recently faced a bug, which I would like to mention.

Sometimes application fails to start because it can’t find Ebean class. It is connected with logback.xml but I am not sure, how. My breaking change was adding Sentry‘s logback.

There are two solutions:

  • some people fix it by just playing with logback.xml by removing or changing appenders. That doesn’t look so stable.
  • another workaround is to inject EbeanDynamicEvolutions into your repository (AbstractRepository is the best place). You don’t need to use it. I think it is connected with Play’s attempts to run evolutions on start. The connection to logback is still unclear.

DTO layer

Another part of the system which made me disappointed. This layer’s intent is to receive messages from outside (usually REST) and run some actions, based on message type. Usually it means that you get message, parse it (usually from JSON) and pass to service layer. Then take service layer’s return and send outside as an encoded answer. Encoding and decoding messages (DTOs) is the main thing here.

For some reason working with json is unfriendly in Scala. And super unfriendly in Play2.

Json deserialization – is not automated anymore

In normal frameworks specifying the type of an object to be parsed is all you need to do. You specify root object, request body will be parsed and serialized to this object, including all sub-objects. F.e. build(@RequestBody RepositoryDTO body) taken from one of my opensource projects.

In Play you need to set up implicit reader for every sub-object, used in your DTO. In case your MasterDTO contains PetDTO, which contains RoleDTO you have to set up reader for all of them:

def createMaster: Action[AnyContent] = Action.async { request =>
    implicit val formatRole: OFormat[RoleDTO] = Json.format[RoleDTO]
    implicit val formatPet: OFormat[PetDTO] = Json.format[PetDTO]
    implicit val format: OFormat[MasterDTO] = Json.format[MasterDTO]
    val parsed = Json.fromJson(request.body.asJson.get)(format)
    val body: MasterDTO = parsed.getOrElse(null)
    // …
}

Maybe there is some automated way, but I haven’t found it. All approaches end up with getting request’s body as json and parsing it manually.

Finally I’ve ended with json4s and parsing objects like this:

JsonMethods.parse(request.body.asJson.get.toString()).extract[MasterDTO]

What I still don’t like here is you have to get body as json, convert it to string and parse one more time. I am lucky, this project is not realtime, but if your is – think twice before doing so.

Json validation – more boilerplate for the god of boilerplate!

Play has it’s own modern functional way of data validation. In three steps only:

  1. Forget about javax.validation
  2. Define your DTO as case-class. Here you write your field names and their types.
  3. Manually write Form mapping, mentioning all dto’s field names and writing their types once again.

After Slick’s manual schema definition, I’ve expected something shitty. But it overcame my expectations.

The example:

case class SomeDTO(id: Int, text: String, option: Option[Double]).
def validationForm: Form[SomeDTO] = { 
  import play.api.data.Forms._
  Form(
       mapping(
              "id" -> number,
              "text" -> nonEmptyText,
              "option" -> optional(of(doubleFormat))
       )(SomeDTO.apply)(SomeDTO.unapply)
  )
}

It is used like this:

    def failure(badForm: Form[_]) = {
      BadRequest(badForm.errorsAsJson(messagesProvider))
    }

    def success(input: SomeDTO) = {
      // your business logic here 
    }

    validationForm.bindFromRequest()(request).fold(failure, success)

Json serialization – forget about heterogeneity

It was the main problem with Play’s json implementation and the main reason I’ve decided to get rid of it. Unfortunately I haven’t found a quick solution to remove it completely (looks like it is hardcoded) and replace with json4s.

I have all my DTOs implement my JsonSerializable trait and I have few services, which work with generic objects. Imagine DogDTO and CatDTO: they are different business entities but some actions are common. To avoid code duplication I just send them via Pet trait to those services (like FeedPetService). They do their job and return just a List of JsonSerializable objects (can be either Cat or Dog DTOs, based on input type).

It turned out that Play can’t serialize trait if it is not sealed. It requires an implicit writer to be set up explicitly. So after googling a bit I’ve switch to json4s.

Now I have 2 lines of implementation for any DTO:

def toJson(elements: List[JsonSerializable]): String = {
    implicit val formats: AnyRef with Formats = Serialization.formats(NoTypeHints)
    Serialization.write(elements)
  }

It is defined in trait. Every companion object, which extends this trait has json serialization of class-objects out of the box.

Summing up

  • Slick’s creators call Slick “Functional Relational Mapper” (FRM) and claim it to have minimum configuration advantages. As far as I see it is yet another not successful attempt to create something with “Functional” buzzword. From 10 years of my experience I spend around 4 years in functional programming (Erlang) and saw a lot of dead projects, which started like “New Innovative Functional Approach”
  • Scala’s implicit is something magical which breaks KISS principle and makes the code messy. Here is a very good thread about Scala implicits + Slick
  • Working with json in Play2 is pain.